PR #1663

open

Record: SP8192 + 3-Layer Recurrence + Hard Onset — val_bpb 1.08625 (3-seed mean)

by pablinga19View on GitHub
val_bpb
1.0862
Architecture
Transformer
Optimizer
Artifact Size
15,998,501 bytes

Training Techniques

Architecture
depth recurrence
Uses a 3-layer recurrence stack in the model.
parameters: {"layers":3}
Other
other
Hard recurrence onset at step 3000 instead of smooth recurrence homotopy, delaying activation of recurrence to preserve more non-recurrent training within the fixed time budget.
parameters: {"recur_start_step":3000,"recur_homotopy":0}
other
Suppresses mid-training validation passes to increase realized training steps within the fixed 600-second budget.
parameters: {"val_loss_every":99999}
Evaluation
sliding window eval
parameters: null
Weight Averaging
EMA
parameters: null
Quantization
int6
bits: 6
scope: sliding eval artifact

Novel Contributions

  • Hard activation of recurrence at step 3000 instead of smooth onset
  • 3-layer recurrence stack with fixed training stack from PR #1394
  • Use of VAL_LOSS_EVERY=99999 to avoid mid-training validation and gain more training steps under the time budget
  • 3-seed mean sliding val_bpb of 1.08625