PR #1662
closedRecord: SP8192 + 3-layer recurrence + hard onset — val_bpb 1.0862 (3-seed mean)
by pablinga19View on GitHub
val_bpb
1.0862
Architecture
Transformer
Optimizer
—
Artifact Size
15.94 MB
Training Techniques
Architecture
depth recurrence
3-layer recurrence stack applied on layers 3-5 with hard onset activation
parameters: {"layers":[3,4,5],"onset_step":3000}
Quantization
GPTQ
bits: 6
scope: all
Compression
brotli
level: null
Weight Averaging
EMA
parameters: {"decay":0.9965}
Regularization
logit softcap
parameters: {"sdclip":12.85}
Other
other
SP8192 tokenization / sequence packing setup
parameters: {"sp":8192}
other
Removed mid-training validation passes to increase realized training steps within the fixed wallclock budget
parameters: {"val_loss_every":99999}
Novel Contributions
- Hard onset at step 3000 for the 3-layer recurrence stack
- 3-layer recurrence on layers 3-5
- SP8192 setup
- GPTQ int6 with brotli artifact compression
- EMA 0.9965 with SDClip 12.85
- Skipping mid-training validation passes to maximize training steps under the time budget