PR #1979

open

[Non-record] Long-Train Artifact Scaling: post-TTT BPB = 1.0399, artifact size constant across 10–60 min

by Christopher-Lee-McClendonView on GitHub

val_bpb

1.0399

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,944,203 bytes

Training Techniques

Quantization

GPTQ

bits: 6

scope: weights

Compression

lrzip

level: null

Architecture

U-Net skip connections

U-Net style skip connections in the model architecture

parameters: null

GQA

Grouped query attention with 8 attention heads and 4 KV heads

parameters: {"heads":8,"kv_heads":4}

Partial RoPE

Partial rotary positional embeddings

parameters: {"dimensions":16}

depth recurrence

Looped layers for recurrent depth processing

parameters: {"layers":[3,5],"num_loops":2}

SmearGate

SmearGate with sparse attention gating

parameters: {"window":12}

CaseOps

Bijective case transform

parameters: {"alphabet_size":8192}

LQER

Asymmetric INT2/INT4 low-rank correction on top tensors

parameters: {"rank":4,"top_k":3,"group":64}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"adam_for_scalars":true}

Test-Time Training

score-first TTT

parameters: {"phases":3,"prefix_docs":2000,"warm_start_a":true}

Sequence Length

sequence_length

train_length: null

eval_length: null

Regularization

weight decay

parameters: {"embed_wd":0.06}

Novel Contributions

Measured artifact size across 10, 20, 30, 45, and ~60 minute training checkpoints
Showed that compressed artifact size stays essentially constant across longer training
Demonstrated that BPB improves with longer training while compression reaches an entropy floor early
Extended the PR #1950 recipe with a non-record long-train scaling experiment
Used synchronized checkpoint export to avoid distributed rank desynchronization during serialize pauses