PR #573
closedRecord: 11L XSA4 + Multi-Pass Streaming Score-First Legal TTT (3-seed mean val_bpb=1.0523)
by SarimsaljookView on GitHub
val_bpb
1.0523
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.92 MB
Training Techniques
Test-Time Training
score-first multi-pass legal TTT
parameters: {"passes":3,"learning_rate":0.0005,"batch_size":"16 sequences x 2048 tokens","optimizer":"AdamW"}
Architecture
XSA
Exclusive Self-Attention in the last 4 layers
parameters: {"layers":4}
Partial RoPE
Rotary position embeddings applied to a subset of head dimensions
parameters: {"dimensions":16,"total_head_dims":64}
MLP3x
3x MLP expansion
parameters: {"hidden_dim":1536}
tied embeddings
Input and output embeddings are tied
parameters: null
U-Net skip connections
Encoder-decoder skip connections in the transformer stack
parameters: {"encoder_layers":5,"decoder_layers":6}
SmearGate
Custom gating mechanism combined with BigramHash
parameters: null
BigramHash
Bigram hashing module with bucketed representation
parameters: {"buckets":2048,"dim":128}
Quantization
int6 QAT
bits: 6
scope: MLP + attention
STE QAT
bits: 6
scope: late QAT when LR scale < 0.15
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025}
AdamW
weight_decay: 0
momentum: null
other_params: {"used_for":"TTT and embeddings/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"frequency_steps":50,"start_condition_lr_scale":0.2}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"score_first":true,"multi_pass":true}
LR Schedule
cosine decay
parameters: {"to_zero":true,"warmdown_steps":3500}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer_idx+1)"}
Other
other
Rotary cache backprop fix by cloning cached cos/sin tensors to avoid inference tensor autograd errors
parameters: null
Novel Contributions
- Multi-pass streaming score-first legal test-time training with shifted data orderings
- Per-token final score chosen as the minimum NLL across three adaptation trajectories
- Rotary cache backprop fix using cloned cached cos/sin tensors to enable TTT through partial RoPE
- Legal TTT evaluation that scores tokens before any training on those tokens