PR #1339

open

Record: SP2048 + 3-Layer Recurrence + SWA + BigramHash + Legal TTT — val_bpb 1.0955 (3-seed mean)

val_bpb
1.0955
Architecture
Transformer
Optimizer
SGD
Artifact Size
~15.49 MB

Training Techniques

Architecture
BigramHash
Adds a bigram hash embedding side channel to the logits.
parameters: {"vocab":2048,"dim":128}
depth recurrence
Uses 3-layer depth recurrence across layers 3, 4, and 5.
parameters: {"layers":[3,4,5],"start_step":3000}
weight tying
Uses SP2048 SentencePiece BPE vocabulary with tied tokenization setup implied by the canonical model family; no explicit weight tying was stated.
parameters: {"vocab_size":2048}
parallel residuals
Applies parallel residual connections starting from layer 7.
parameters: {"start_layer":7}
SP2048 vocabulary
Uses a 2048-token SentencePiece BPE vocabulary.
parameters: {"vocab_size":2048}
Weight Averaging
SWA
parameters: {"start_frac":0.75}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"epochs":3}
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"matrix_lr":0.022,"variant":"MuonEq-R","qk_gain":5}
Quantization
GPTQ
bits: 6
scope: full model
Compression
Brotli
level: null

Novel Contributions

  • First SP2048 submission combining SWA, BigramHash, 3-layer depth recurrence, and legal TTT
  • 3-layer depth recurrence over layers 3, 4, and 5
  • BigramHash embeddings with a 2048-token vocabulary
  • Stochastic Weight Averaging from fraction 0.75
  • Legal score-first test-time training
  • Full GPTQ int6 quantization with Brotli compression