PR #1338
closedRecord: SP2048 + 3-Layer Recurrence + SWA + BigramHash + Legal TTT — val_bpb 1.0955 (3-seed mean)
by bigbagView on GitHub
val_bpb
1.0955
Architecture
Transformer
Optimizer
SGD
Artifact Size
~15.49 MB
Training Techniques
Architecture
BigramHash
Adds bigram hash embeddings / n-gram side channel to logits.
parameters: {"vocab":2048,"dim":128}
depth recurrence
3-layer depth recurrence applied to layers 3, 4, and 5.
parameters: {"layers":[3,4,5],"start_step":3000}
parallel residuals
Uses parallel residual connections starting from layer 7.
parameters: {"start_layer":7}
SP2048 vocabulary
Uses a 2048-token SentencePiece BPE vocabulary.
parameters: {"vocab_size":2048}
Weight Averaging
SWA
parameters: {"start_frac":0.75}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"epochs":3}
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"variant":"MuonEq-R","qk_gain":5,"matrix_lr":0.022}
Quantization
GPTQ
bits: 6
scope: full model
Compression
Brotli
level: null
Novel Contributions
- First SP2048 submission combining SWA, BigramHash, 3-layer depth recurrence, and legal TTT
- 3-layer depth recurrence over layers 3, 4, and 5
- BigramHash embeddings with a 2048-token vocabulary
- Stochastic Weight Averaging starting from fraction 0.75
- Legal score-first test-time training with SGD
- Full GPTQ int6 plus Brotli artifact compression