PR #731
openRecord: 1.0400 BPB -- Hedge Mixer + VRL + AdamW TTT + Polyak EMA
by pentxaycView on GitHub
val_bpb
1.0400
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,999,919 bytes
Training Techniques
Architecture
VRL
Value Residual Learning with a residual connection from layer 0's value output to all subsequent layers
parameters: null
LeakyReLU
LeakyReLU activation squared
parameters: {"negative_slope":0.5,"power":2}
XSA-4
Cross-Token Self-Attention applied on the last 4 layers
parameters: {"layers":4}
tied embeddings
Input and output embeddings are tied
parameters: null
GQA
Grouped-query attention with 8 query heads and 4 KV heads
parameters: {"query_heads":8,"kv_heads":4}
BigramHash
Hashed bigram feature table used in the model
parameters: {"buckets":2048}
Optimizer
AdamW
weight_decay: 0
momentum: null
other_params: {"learning_rate":0.0005,"test_time_training":true}
Weight Averaging
Polyak averaging
parameters: {"decay":0.998}
Compression
lzma
level: null
Test-Time Training
score-first TTT
parameters: {"optimizer":"AdamW","learning_rate":0.0005,"polyak_decay":0.998,"freeze_first_blocks":9,"unfreeze_last_blocks":2,"epochs_per_chunk":3,"byte_weighted_loss":true,"adaptive_cosine_lr":true}
LR Schedule
adaptive cosine decay
parameters: {"ramp_multiplier_start":1,"ramp_multiplier_end":3,"ramp_fraction":0.3}
Other
other
Hedge Mixer online ensemble combining neural predictions with unigram, bigram, trigram, and entropy experts using multiplicative weights
parameters: {"experts":5,"eta":0.1,"deferred_updates":true}
Novel Contributions
- Value Residual Learning (VRL) in the transformer
- 5-expert Hedge Mixer during evaluation
- Deferred score-first Hedge weight updates
- AdamW test-time training with Polyak EMA
- Byte-weighted loss for TTT
- Adaptive cosine learning rate during TTT
- Freeze-first-9-blocks / unfreeze-last-2-blocks TTT scheme
- Int6 mixed quantization with lzma compression