PR #1763

open

Add non-record SP8192 proxy-stack submission (3-seed)

by gmn0105View on GitHub
val_bpb
1.2459
Architecture
Transformer
Optimizer
Artifact Size
14,111,424

Training Techniques

Quantization
GPTQ
bits: 6
scope: all
Weight Averaging
EMA
parameters: null
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192
Evaluation
sliding window eval
parameters: null

Novel Contributions

  • Switching the Track-P proxy stack from SP4096 to SP8192 under the same 10-minute wallclock budget on 1×H100 improves validation BPB.
  • Uses a 5-shard FineWeb screening setup with three seeds to demonstrate reproducible gains.
  • Reports both pre-quantization post-EMA validation BPB and final int6 sliding-window BPB.