PR #1366

open

Non-record: EMA+SWA Tight Averaging with Fused TTT LoRA + Sliding Window (1.1371 BPB)

by yunoshevView on GitHub
val_bpb
1.1371
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.88 MB

Training Techniques

Weight Averaging
EMA + SWA
parameters: {"decay":0.997,"tight_averaging":true,"collect_from":"EMA state","qgrid_lambda":false}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"fused":true}
Evaluation
sliding window eval
parameters: {"stride":256}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"warmup_start_momentum":0.92,"warmup_steps":1500,"warmdown_iters":3500}
Quantization
QAT
bits: null
scope: MLP int5, attention int6
GPTQ
bits: null
scope: full Hessians for all tensors
Architecture
BigramHash
Bigram hash embedding component
parameters: {"size":4096}
VE128
Value embedding / value expansion setting
parameters: {"dimensions":128}
XSA
Uses XSA in the last layers
parameters: {"layers":4}
Partial RoPE
Partial rotary positional embeddings
parameters: {"percent":25}
MLP3x
3x MLP expansion
parameters: null
Sequence Length
sequence_length
train_length: null
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3500}

Novel Contributions

  • EMA + SWA tight averaging with SWA collected from EMA state
  • Disabling qgrid_lambda to avoid snapping EMA weights to the quantization grid
  • Fused TTT LoRA with sliding window evaluation in a single pass
  • Muon optimizer momentum tuning with warmup and warmdown
  • Full GPTQ Hessians for all tensors including attention projection and MLP down-projection