PR #2008

open

[Non-Record] 4h Long-Train Scaling: Quantized BPB 1.0449

by Christopher-Lee-McClendonView on GitHub
val_bpb
1.0449
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,932,638 bytes

Training Techniques

Quantization
GPTQ
bits: 6
scope: all
Weight Averaging
EMA
parameters: null
Test-Time Training
score-first TTT
parameters: {"phases":3,"prefix_docs":2000}
Architecture
U-Net skip connections
U-Net style skip connections in the model architecture
parameters: null
GQA
Grouped query attention with fewer KV heads than attention heads
parameters: {"attention_heads":8,"kv_heads":4}
Partial RoPE
Partial rotary positional embeddings applied to a subset of dimensions
parameters: {"dimensions":16}
depth recurrence
Looped recurrence over selected layers
parameters: {"loop_layers":[3,4,5],"num_loops":2}
SmearGate
SmearGate with sparse attention gating
parameters: {"window":12}
CaseOps
Bijective case transform over SP8192
parameters: {"vocab":"SP8192"}
MLP3x
4x MLP expansion
parameters: {"expansion":4}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adam_for_scalars":true}
Compression
pergroup lrzip
level: null
Sequence Length
sequence_length
train_length: null
eval_length: null
LR Schedule
warmdown
parameters: null
Regularization
weight decay
parameters: {"embed_wd":0.06}

Novel Contributions

  • 4-hour long-train scaling study showing monotonic BPB improvement over time
  • Quantized 4h model reaches BPB 1.0449, close to the 1h post-TTT result
  • Resumable checkpoint infrastructure with manifest-driven resume
  • Long-train periodic export and JSON metrics at configurable milestones
  • TTT sweep orchestration framework for controlled variant evaluation
  • Extended launcher supporting duration-hours mode and budget-aware timeouts