val_bpb
1.2459
Architecture
Transformer
Optimizer
—
Artifact Size
14,111,424
Training Techniques
Quantization
GPTQ
bits: 6
scope: all
Weight Averaging
EMA
parameters: null
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192
Evaluation
sliding window eval
parameters: null
Novel Contributions
- Switching the Track-P proxy stack from SP4096 to SP8192 under the same 10-minute wallclock budget on 1×H100 improves validation BPB.
- Uses a 5-shard FineWeb screening setup with three seeds to demonstrate reproducible gains.
- Reports both pre-quantization post-EMA validation BPB and final int6 sliding-window BPB.