PR #445

closed

Late Training Replay + EMA + GPTQ-lite (val_bpb=1.1236, 2-seed, no TTT on eval)

by newjordanView on GitHub
val_bpb
1.1236
Architecture
11L Transformer
Optimizer
Muon
Artifact Size
15.59 MB

Training Techniques

Quantization
GPTQ-lite
bits: 6
scope: all
Architecture
MLP3x
3x MLP with relu^2 activation
parameters: null
XSA
Uses XSA4 attention/sequence component
parameters: {"variant":"XSA4"}
SmearGate
Includes SmearGate gating mechanism
parameters: null
BigramHash
Uses BigramHash feature with hashed vocabulary
parameters: {"size":2048}
Partial RoPE
Applies rotary position embeddings only partially
parameters: {"16/64":true}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"description":"Tight SWA"}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
test_time_training
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
layerwise LN scale
parameters: null
Other
other
Late training replay of the last 100 training batches for 2 epochs at 10% learning rate before EMA finalization
parameters: {"epochs":2,"batches":100,"lr_fraction":0.1}

Novel Contributions

  • Late training replay of the last 100 training batches before EMA finalization
  • No test-time training on validation data
  • EMA combined with GPTQ-lite and late-stage replay
  • Sliding-window evaluation with stride 64
  • 2-seed mean reporting for validation BPB