PR #2008

open

[Non-Record] 4h Long-Train Scaling: Quantized BPB 1.0449

by Christopher-Lee-McClendonView on GitHub

val_bpb

1.0449

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,932,638 bytes

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Weight Averaging

EMA

parameters: null

Test-Time Training

score-first TTT

parameters: {"phases":3,"prefix_docs":2000}

Architecture

U-Net skip connections

U-Net style skip connections in the model architecture

parameters: null

GQA

Grouped query attention with fewer KV heads than attention heads

parameters: {"attention_heads":8,"kv_heads":4}

Partial RoPE

Partial rotary positional embeddings applied to a subset of dimensions

parameters: {"dimensions":16}

depth recurrence

Looped recurrence over selected layers

parameters: {"loop_layers":[3,4,5],"num_loops":2}

SmearGate

SmearGate with sparse attention gating

parameters: {"window":12}

CaseOps

Bijective case transform over SP8192

parameters: {"vocab":"SP8192"}

MLP3x

4x MLP expansion

parameters: {"expansion":4}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"adam_for_scalars":true}

Compression

pergroup lrzip

level: null

Sequence Length

sequence_length

train_length: null

eval_length: null

LR Schedule

warmdown

parameters: null

Regularization

weight decay

parameters: {"embed_wd":0.06}

Novel Contributions

4-hour long-train scaling study showing monotonic BPB improvement over time
Quantized 4h model reaches BPB 1.0449, close to the 1h post-TTT result
Resumable checkpoint infrastructure with manifest-driven resume
Long-train periodic export and JSON metrics at configurable milestones
TTT sweep orchestration framework for controlled variant evaluation
Extended launcher supporting duration-hours mode and budget-aware timeouts