PR #1979
open[Non-record] Long-Train Artifact Scaling: post-TTT BPB = 1.0399, artifact size constant across 10–60 min
by Christopher-Lee-McClendonView on GitHub
val_bpb
1.0399
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,944,203 bytes
Training Techniques
Quantization
GPTQ
bits: 6
scope: weights
Compression
lrzip
level: null
Architecture
U-Net skip connections
U-Net style skip connections in the model architecture
parameters: null
GQA
Grouped query attention with 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
Partial rotary positional embeddings
parameters: {"dimensions":16}
depth recurrence
Looped layers for recurrent depth processing
parameters: {"layers":[3,5],"num_loops":2}
SmearGate
SmearGate with sparse attention gating
parameters: {"window":12}
CaseOps
Bijective case transform
parameters: {"alphabet_size":8192}
LQER
Asymmetric INT2/INT4 low-rank correction on top tensors
parameters: {"rank":4,"top_k":3,"group":64}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adam_for_scalars":true}
Test-Time Training
score-first TTT
parameters: {"phases":3,"prefix_docs":2000,"warm_start_a":true}
Sequence Length
sequence_length
train_length: null
eval_length: null
Regularization
weight decay
parameters: {"embed_wd":0.06}
Novel Contributions
- Measured artifact size across 10, 20, 30, 45, and ~60 minute training checkpoints
- Showed that compressed artifact size stays essentially constant across longer training
- Demonstrated that BPB improves with longer training while compression reaches an entropy floor early
- Extended the PR #1950 recipe with a non-record long-train scaling experiment
- Used synchronized checkpoint export to avoid distributed rank desynchronization during serialize pauses