PR #1150

open

Legal TTT (SGD, 3-epoch) + SLOT (lr=0.003, steps=5) on PR #549 base -- val_bpb: 1.11512 (3-seed mean, beats merged SOTA 1.1194)

by sahiee-devView on GitHub
val_bpb
1.1151
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.95-15.96MB

Training Techniques

Architecture
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
Rotary position embeddings applied to a subset of dimensions.
parameters: {"rope_dims":16}
LeakyReLU
LeakyReLU squared MLP activation.
parameters: {"squared":true,"mlp_mult":2.8}
BigramHash
Bigram hash embedding module.
parameters: {"vocab":1536,"dim":128}
VE128
VE128 used at layers 9-10.
parameters: {"layers":[9,10]}
XSA
XSA used in the last 4 layers.
parameters: {"layers":4}
LN Scale
LayerNorm scale modification.
parameters: null
SmearGate
SmearGate component included in the architecture.
parameters: null
U-Net skip connections
U-Net style skip connections in the transformer backbone.
parameters: null
Weight Averaging
EMA + SWA
parameters: {"decay":0.997}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: 6
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"adamw":true}
SGD
weight_decay: null
momentum: null
other_params: {"test_time_training":true,"epochs":3}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"chunk_tokens":32768,"batch_seqs":32,"epochs":3}
Other
other
SLOT test-time adaptation using a per-batch residual delta optimized on top of frozen hidden states before the final logits projection.
parameters: {"lr":0.003,"steps":5}
Regularization
LN scale
parameters: null

Novel Contributions

  • Adds SLOT test-time adaptation on top of legal score-first TTT.
  • Uses a per-batch residual delta in hidden space to adapt logits without updating model weights.
  • Combines legal TTT with SLOT while preserving score-first, left-to-right evaluation constraints.
  • Achieves a 3-seed mean val_bpb of 1.11512, beating the merged SOTA of 1.1194.
  • Keeps artifact size under 16MB and evaluation time under 600s across all seeds.