PR #811
openRecord: Complementary Training + Backoff N-gram Mixer — 0.4377 BPB
by quietsmileView on GitHub
val_bpb
0.4377
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.9MB
Training Techniques
Architecture
XSA
Uses XSA on the last 4 layers.
parameters: {"layers":4}
MLP3x
3x MLP with LeakyReLU(0.5)^2.
parameters: null
KV head count
Uses 4 KV heads with 8 attention heads.
parameters: {"heads":8,"kv_heads":4}
Quantization
mixed int6
bits: 6
scope: model weights
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"learning_rate":0.0005}
Weight Averaging
EMA
parameters: {"decay":0.998}
Evaluation
stride-based eval
parameters: {"stride":128}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.0005,"epochs":4,"freeze_blocks":2,"temperature":0.98}
Sequence Length
sequence_length
train_length: null
eval_length: null
Other
other
Complementary training with bigram-weighted loss reweighting to focus learning on harder tokens.
parameters: {"complement_alpha":0.5}
other
BackoffNgramMixer with orders 2-10 and entropy-adaptive alpha mixing.
parameters: {"ngram_order":10,"alpha_base":0.2,"alpha_range":0.55,"alpha_center":3}
Compression
lzma
level: null
Novel Contributions
- Complementary training with bigram-weighted loss reweighting
- BackoffNgramMixer with entropy-adaptive alpha mixing
- Legal score-first AdamW test-time training
- Stride=128 evaluation optimization