PR #398

open

Non-record: 11L EMA + TTT(20ep,freeze=0) + 15-run ablation study — val_bpb=1.1213 (3-seed)

by felipe-parodiView on GitHub
val_bpb
1.1213
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.53 MB

Training Techniques

Architecture
SmearGate
Adds SmearGate to the model architecture.
parameters: null
BigramHash
Uses a BigramHash embedding/component with vocabulary size 2048 and dimension 128.
parameters: {"vocab_size":2048,"dim":128}
Partial RoPE
Applies rotary position embeddings to only part of the dimensions.
parameters: {"dimensions":16}
MLP3x
Uses a 3x-width MLP block.
parameters: {"hidden":1536}
KV head count
Uses grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
weight tying
Uses tied embeddings.
parameters: null
Initialization
OrthoInit
Orthogonal initialization.
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
mixed int6
bits: 6
scope: all
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"epochs":20,"learning_rate":0.008,"momentum":0.9,"freeze_blocks":0}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"scalar_lr":0.025,"tied_embed_lr":0.035}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Sequence Length
sequence_length
train_length: 2048
eval_length: null

Novel Contributions

  • EMA(0.997) combined with aggressive 20-epoch test-time training
  • All blocks unfrozen during TTT (freeze_blocks=0) was critical for best performance
  • 15-run ablation study identifying negative results such as late QAT, memory tokens, warmdown=20000, and PPM-C blending
  • Removal of XSA to save step time and gain additional training steps within the wall-clock budget
  • Mixed int6 quantization with zstd-22 compression under the 16MB artifact constraint