PR #714
openAdd 11L RotaryFix + LegalTTT + BIGRAM3072 — val_bpb 1.11869 (3-seed m…
by UpsallaView on GitHub
val_bpb
1.1187
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~16.06 MB
Training Techniques
Architecture
RoPE
Rotary NTK-scaling bug fix; train_seq_len=2048 is correctly propagated to both base_model and eval_model instead of being hardcoded to 1024.
parameters: {"train_seq_len":2048}
BigramHash
Expanded bigram vocabulary size for the hash-based bigram component.
parameters: {"vocab_size":3072}
MLP3x
Uses a 3x MLP with LeakyReLU(0.5)^2.
parameters: null
XSA
XSA applied to the last 4 layers.
parameters: {"last_n_layers":4}
Partial RoPE
Partial rotary positional embeddings using 16 of 64 dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
VE128
VE enabled in layers 9-10.
parameters: {"dim":128,"layers":[9,10]}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"every":50}
Quantization
GPTQ-lite
bits: 6
scope: all
QAT
bits: 6
scope: all
Evaluation
sliding window eval
parameters: {"chunk_size":32768}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"epochs_per_chunk":3,"optimizer":"SGD","momentum":0.9,"freeze_blocks":0}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
cosine decay
parameters: {"warmdown_steps":3500}
Regularization
layerwise LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
Other
other
Late QAT threshold tuning to extend quantization-aware adaptation time.
parameters: {"threshold":0.57}
Novel Contributions
- Fixed a Rotary NTK-scaling bug by correctly propagating train_seq_len=2048 to both training and evaluation models.
- Applied a previously unreported eval_model Rotary fix affecting the causal TTT scoring window.
- Increased BigramHash vocabulary size to 3072.
- Raised the late QAT threshold to 0.57 to allow substantially more QAT steps.
- Used torch.no_grad() instead of torch.inference_mode() during TTT scoring to avoid Autograd graph corruption across RoPE cache boundaries.
- Introduced Legal TTT: score-first, backward-looking test-time training on already-scored chunks.