PR #964

open

Record: Doc-Isolated TTT + Eval Optimizations

by vivekvar-dlView on GitHub
val_bpb
1.3900
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size

Training Techniques

Test-Time Training
full TTT
parameters: {"document_isolated":true,"reset_at_bos":true}
Evaluation
temperature scaling
parameters: {"temperature_range":[0.9,1]}
Architecture
LeakyReLU
Uses LeakyReLU squared MLP activation
parameters: {"squared":true,"negative_slope":0.5}
XSA
Uses XSA attention variant
parameters: {"version":4}
Partial RoPE
Applies partial rotary positional embeddings
parameters: {"train":"16/64"}
VE128
Uses VE128 component
parameters: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA + SWA
parameters: null
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: null
Regularization
LN scale
parameters: null

Novel Contributions

  • Document-isolated TTT by resetting optimizer state at BOS document boundaries
  • Temperature scaling during evaluation on the quantized model
  • Evaluation of doc isolation as an adaptation improvement on the frontier architecture