PR #1663

open

Record: SP8192 + 3-Layer Recurrence + Hard Onset — val_bpb 1.08625 (3-seed mean)

by pablinga19View on GitHub

val_bpb

1.0862

Architecture

Transformer

Optimizer

—

Artifact Size

15,998,501 bytes

Training Techniques

Architecture

depth recurrence

Uses a 3-layer recurrence stack in the model.

parameters: {"layers":3}

Other

other

Hard recurrence onset at step 3000 instead of smooth recurrence homotopy, delaying activation of recurrence to preserve more non-recurrent training within the fixed time budget.

parameters: {"recur_start_step":3000,"recur_homotopy":0}

other

Suppresses mid-training validation passes to increase realized training steps within the fixed 600-second budget.

parameters: {"val_loss_every":99999}

Evaluation

sliding window eval

parameters: null

Weight Averaging

EMA

parameters: null

Quantization

int6

bits: 6

scope: sliding eval artifact

Novel Contributions

Hard activation of recurrence at step 3000 instead of smooth onset
3-layer recurrence stack with fixed training stack from PR #1394
Use of VAL_LOSS_EVERY=99999 to avoid mid-training validation and gain more training steps under the time budget
3-seed mean sliding val_bpb of 1.08625