PR #901

open

record: 10L d496 WarmDown3500 SWA — val_bpb 1.1590 (1xH100 proxy)

by Hilo-HiloView on GitHub
val_bpb
1.1590
Architecture
Transformer
Optimizer
Artifact Size
15.94 MB

Training Techniques

Weight Averaging
SWA
parameters: {"start_frac":0.4,"every":50}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Evaluation
stride-based eval
parameters: {"stride":64}
Test-Time Training
TTT
parameters: null
Quantization
int6
bits: 6
scope: model
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • Environment-only tuning with stock train_gpt.py and no code changes
  • Reduced model dimension to 496 to fit under the 16MB artifact limit
  • Extended warmdown schedule to 3500 iterations
  • Used SWA with a 0.4 start fraction and 50-step averaging interval
  • Disabled TTT to keep evaluation fast
  • Reported a 1xH100 proxy result for an unverified 8xH100 configuration