PR #1662

closed

Record: SP8192 + 3-layer recurrence + hard onset — val_bpb 1.0862 (3-seed mean)

by pablinga19View on GitHub
val_bpb
1.0862
Architecture
Transformer
Optimizer
Artifact Size
15.94 MB

Training Techniques

Architecture
depth recurrence
3-layer recurrence stack applied on layers 3-5 with hard onset activation
parameters: {"layers":[3,4,5],"onset_step":3000}
Quantization
GPTQ
bits: 6
scope: all
Compression
brotli
level: null
Weight Averaging
EMA
parameters: {"decay":0.9965}
Regularization
logit softcap
parameters: {"sdclip":12.85}
Other
other
SP8192 tokenization / sequence packing setup
parameters: {"sp":8192}
other
Removed mid-training validation passes to increase realized training steps within the fixed wallclock budget
parameters: {"val_loss_every":99999}

Novel Contributions

  • Hard onset at step 3000 for the 3-layer recurrence stack
  • 3-layer recurrence on layers 3-5
  • SP8192 setup
  • GPTQ int6 with brotli artifact compression
  • EMA 0.9965 with SDClip 12.85
  • Skipping mid-training validation passes to maximize training steps under the time budget