PR #1956

open

Record: PR #1908 reproduction with compliant 600s wallclock — val_bpb 1.06044 (3-seed mean)

by AayushBaniya2006View on GitHub
val_bpb
1.0604
Architecture
Transformer
Optimizer
Artifact Size
15,950,342 bytes

Training Techniques

Quantization
GPTQ-lite
bits: null
scope: model weights
mixed int4/int8
bits: null
scope: model weights
Architecture
SmearGate
SmearGate is part of the inherited architecture stack used in the reproduced submission.
parameters: null
Gated Attention
Uses gated attention as part of the model stack.
parameters: null
weight tying
Tied embeddings / weight tying are part of the inherited model setup.
parameters: null
Test-Time Training
full TTT
parameters: {"phased":true,"num_phases":3}
Sequence Length
sequence_length
train_length: 8192
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Regularization
logit softcap
parameters: null
weight decay
parameters: {"value":0.5}
Other
other
Organic wallclock-controlled training to stop within the 600s cap instead of forcing a fixed stop step.
parameters: {"max_wallclock_seconds":600,"force_stop_step":null}

Novel Contributions

  • Reproduces PR #1908's stack under compliant 600-second wallclock control
  • Demonstrates the same recipe can achieve the record while staying under the training cap
  • Removes FORCE_STOP_STEP and relies on organic wallclock stopping
  • Reports a 3-seed mean val_bpb of 1.06043952 with all seeds under the artifact and time limits