PR #1660

closed

Record: SP8192 + 3-Layer Recurrence + Hard Onset — val_bpb 1.0858

by pablinga19View on GitHub

val_bpb

1.0858

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.86 MB

Training Techniques

Architecture

depth recurrence

3-layer recurrence with layers 3, 4, 5 repeated as virtual layers and activated via hard gate

parameters: {"layers":[3,4,5],"onset_step":3000}

LeakyReLU

LeakyReLU squared activation

parameters: {"slope":0.5}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Quantization

GPTQ

bits: 6

scope: all

Compression

brotli

level: null

Evaluation

sliding window eval

parameters: null

Optimizer

Muon

weight_decay: 0.095

momentum: 0.99

other_params: null

Regularization

logit softcap

parameters: {"value":30}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_frac":0.72}

Novel Contributions

3-layer recurrence over layers 3, 4, and 5 with hard onset at step 3000
Elimination of validation pauses to reclaim wallclock for additional training steps
GPTQ int6 quantization with brotli compression under the 16MB artifact limit
No test-time training
EMA with decay 0.9965 and Muon optimizer configuration