PR #1662

closed

Record: SP8192 + 3-layer recurrence + hard onset — val_bpb 1.0862 (3-seed mean)

val_bpb

1.0862

Architecture

Transformer

Optimizer

—

Artifact Size

15.94 MB

Training Techniques

Architecture

depth recurrence

3-layer recurrence stack applied on layers 3-5 with hard onset activation

parameters: {"layers":[3,4,5],"onset_step":3000}

Quantization

GPTQ

bits: 6

scope: all

Compression

brotli

level: null

Weight Averaging

EMA

parameters: {"decay":0.9965}

Regularization

logit softcap

parameters: {"sdclip":12.85}

Other

other

SP8192 tokenization / sequence packing setup

parameters: {"sp":8192}

other

Removed mid-training validation passes to increase realized training steps within the fixed wallclock budget

parameters: {"val_loss_every":99999}

Hard onset at step 3000 for the 3-layer recurrence stack
3-layer recurrence on layers 3-5
SP8192 setup
GPTQ int6 with brotli artifact compression
EMA 0.9965 with SDClip 12.85
Skipping mid-training validation passes to maximize training steps under the time budget