PR #1660

closed

Record: SP8192 + 3-Layer Recurrence + Hard Onset — val_bpb 1.0858

by pablinga19View on GitHub
val_bpb
1.0858
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.86 MB

Training Techniques

Architecture
depth recurrence
3-layer recurrence with layers 3, 4, 5 repeated as virtual layers and activated via hard gate
parameters: {"layers":[3,4,5],"onset_step":3000}
LeakyReLU
LeakyReLU squared activation
parameters: {"slope":0.5}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Quantization
GPTQ
bits: 6
scope: all
Compression
brotli
level: null
Evaluation
sliding window eval
parameters: null
Optimizer
Muon
weight_decay: 0.095
momentum: 0.99
other_params: null
Regularization
logit softcap
parameters: {"value":30}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_frac":0.72}

Novel Contributions

  • 3-layer recurrence over layers 3, 4, and 5 with hard onset at step 3000
  • Elimination of validation pauses to reclaim wallclock for additional training steps
  • GPTQ int6 quantization with brotli compression under the 16MB artifact limit
  • No test-time training
  • EMA with decay 0.9965 and Muon optimizer configuration