PR #1339
openRecord: SP2048 + 3-Layer Recurrence + SWA + BigramHash + Legal TTT — val_bpb 1.0955 (3-seed mean)
by bigbagView on GitHub
val_bpb
1.0955
Architecture
Transformer
Optimizer
SGD
Artifact Size
~15.49 MB
Training Techniques
Architecture
BigramHash
Adds a bigram hash embedding side channel to the logits.
parameters: {"vocab":2048,"dim":128}
depth recurrence
Uses 3-layer depth recurrence across layers 3, 4, and 5.
parameters: {"layers":[3,4,5],"start_step":3000}
weight tying
Uses SP2048 SentencePiece BPE vocabulary with tied tokenization setup implied by the canonical model family; no explicit weight tying was stated.
parameters: {"vocab_size":2048}
parallel residuals
Applies parallel residual connections starting from layer 7.
parameters: {"start_layer":7}
SP2048 vocabulary
Uses a 2048-token SentencePiece BPE vocabulary.
parameters: {"vocab_size":2048}
Weight Averaging
SWA
parameters: {"start_frac":0.75}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"epochs":3}
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"matrix_lr":0.022,"variant":"MuonEq-R","qk_gain":5}
Quantization
GPTQ
bits: 6
scope: full model
Compression
Brotli
level: null
Novel Contributions
- First SP2048 submission combining SWA, BigramHash, 3-layer depth recurrence, and legal TTT
- 3-layer depth recurrence over layers 3, 4, and 5
- BigramHash embeddings with a 2048-token vocabulary
- Stochastic Weight Averaging from fraction 0.75
- Legal score-first test-time training
- Full GPTQ int6 quantization with Brotli compression