PR #455
openRecord: 11L Tight SWA + VE128 + XSA4 + TTT (3-seed mean val_bpb=1.1299)
by kasimteView on GitHub
val_bpb
1.1299
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,711,898 bytes
Training Techniques
Architecture
XSA
Efficient partial XSA applied to the last 4 layers, GQA-aware and zero-alloc.
parameters: {"layers":4}
Partial RoPE
Uses partial rotary positional embeddings with NTK-aware scaling.
parameters: {"dimensions":16,"base_dimensions":64}
MLP3x
3x MLP expansion with relu-squared activation.
parameters: {"expansion":3}
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Shared Value Embedding
Shared value embedding table used across layers 9 and 10 with learned per-layer scales.
parameters: {"dimension":128,"layers":[9,10]}
SmearGate
Uses SmearGate combined with BigramHash features.
parameters: null
BigramHash
BigramHash with 2048 buckets and 128-dimensional embeddings.
parameters: {"buckets":2048,"dimension":128}
Quantization
STE QAT
bits: 6
scope: MLP and attention weights; int8 for embeddings
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025,"warmup":"0.92->0.99 over 1500 steps"}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"embeddings_lr":0.035,"scalars_lr":0.025}
Weight Averaging
SWA
parameters: {"checkpoints":12,"interval_steps":50,"start_condition":"scale<0.2","window_steps":600}
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Test-Time Training
full TTT
parameters: {"epochs":3,"optimizer":"SGD","momentum":0.9,"learning_rate":0.002,"batch_size":32,"freezes_first_blocks":2}
Compression
zstd
level: 22
Initialization
Orthogonal initialization
Orthogonal init with projection scaling by 1/sqrt(2*num_layers).
Regularization
layerwise LN scale
parameters: {"scale_factor":"1/sqrt(layer_idx+1)"}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"wallclock_based":true}
Other
other
Late QAT enabled during warmdown when LR scale < 0.1.
parameters: {"trigger":"lr_scale<0.1"}
Novel Contributions
- Tight SWA restricted to late low-scale checkpoints to avoid SWA quality penalty
- Test-time training on already-evaluated validation tokens
- Late STE int6 quantization-aware training during warmdown
- Sliding-window evaluation with stride 64 and context length 2048
- Shared value embedding and partial XSA architecture refinements