PR #1926

open

Record: SP8192 PR #1874 + Optimized Hyperparameters — val_bpb 1.06844 (3-seed mean)

by bigbagView on GitHub

val_bpb

1.0684

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,950,458 bytes

Training Techniques

Quantization

GPTQ

bits: null

scope: all

Architecture

SmearGate

Per-layer smoothing and attention output gating.

parameters: {"width":24}

Gated Attention

Attention output gating with increased gate width.

parameters: {"width":24}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"Newton-Schulz":true}

Test-Time Training

LoRA TTT

parameters: {"rank":128,"phased":true,"score_first":true}

LR Schedule

warmdown

parameters: {"min_lr":0.1}

Regularization

logit softcap

parameters: null

Other

other

Per-head query-key attention scaling.

parameters: {"qk_gain_init":5.25}

other

Reserve less time for GPTQ to maximize training steps.

parameters: {"gptq_reserve_seconds":0.5}

other

Disable mid-training validation loss evaluation to save time and increase training steps.

parameters: {"val_loss_every":0}

Novel Contributions

Activated PR #1874's intended settings via environment variables without code changes.
Used MIN_LR=0.10 to prevent learning-rate collapse during warmdown.
Used QK_GAIN_INIT=5.25 to scale query-key attention.
Used GATE_ATTN_WIDTH=24 to increase attention gate capacity.
Used GPTQ_RESERVE_SECONDS=0.5 to maximize training steps.
Used VAL_LOSS_EVERY=0 to eliminate mid-training evaluation overhead.
Reported 3-seed mean val_bpb of 1.06844 under SP8192 rules.