PR #2106
openRecord: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta
by PiyushDattaView on GitHub
val_bpb
1.0893
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,999,684 bytes
Training Techniques
Architecture
depth recurrence
Layers 3-5 are looped once, giving 14 effective passes from 11 unique layers.
parameters: {"layers":[3,4,5],"passes":14}
LeakyReLU
Uses LeakyReLU(0.5)^2 as the MLP activation.
parameters: {"slope":0.5}
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Quantization
GPTQ
bits: 6
scope: weights and embeddings
mixed int6/int8
bits: null
scope: weights and embeddings
Weight Averaging
SWA
parameters: {"start_scale":0.12,"frequency":"every step"}
Compression
brotli
level: null
Test-Time Training
LoRA TTT
parameters: {"phased":true,"score_first":true}
Optimizer
Muon
weight_decay: 0.095
momentum: 0.95
other_params: {"matrix_lr":0.028,"embed_wd":0.085,"embed_optimizer":"AdamW"}
LR Schedule
warmdown
parameters: {"warmdown_fraction":0.72}
Sequence Length
sequence_length
train_length: 393216
eval_length: null
Other
other
Uses SP8192 tokenizer with an 8x larger vocabulary than the SP1024 baseline.
parameters: {"vocab_size":8192}
other
Uses Polar Express Newton-Schulz coefficients for the Muon optimizer.
parameters: null
Novel Contributions
- Multi-trajectory SWA with independent per-rank warmdown trajectories and cross-rank averaging
- Scale tuning post-GPTQ by freezing int weights and fine-tuning only per-row scales
- Two-pass GPTQ with Hessian recollection on the quantized model
- Selective training-time 2:4 sparsity pruning on MLP weights
- SP8192 tokenizer with GPTQ embeddings and SDClip-style quantization
- Depth recurrence in layers 3-5
- Polar Express Newton-Schulz optimizer coefficients
- Phased LoRA test-time training