PR #1926

open

Record: SP8192 PR #1874 + Optimized Hyperparameters — val_bpb 1.06844 (3-seed mean)

val_bpb
1.0684
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,950,458 bytes

Training Techniques

Quantization
GPTQ
bits: null
scope: all
Architecture
SmearGate
Per-layer smoothing and attention output gating.
parameters: {"width":24}
Gated Attention
Attention output gating with increased gate width.
parameters: {"width":24}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"Newton-Schulz":true}
Test-Time Training
LoRA TTT
parameters: {"rank":128,"phased":true,"score_first":true}
LR Schedule
warmdown
parameters: {"min_lr":0.1}
Regularization
logit softcap
parameters: null
Other
other
Per-head query-key attention scaling.
parameters: {"qk_gain_init":5.25}
other
Reserve less time for GPTQ to maximize training steps.
parameters: {"gptq_reserve_seconds":0.5}
other
Disable mid-training validation loss evaluation to save time and increase training steps.
parameters: {"val_loss_every":0}

Novel Contributions

  • Activated PR #1874's intended settings via environment variables without code changes.
  • Used MIN_LR=0.10 to prevent learning-rate collapse during warmdown.
  • Used QK_GAIN_INIT=5.25 to scale query-key attention.
  • Used GATE_ATTN_WIDTH=24 to increase attention gate capacity.
  • Used GPTQ_RESERVE_SECONDS=0.5 to maximize training steps.
  • Used VAL_LOSS_EVERY=0 to eliminate mid-training evaluation overhead.
  • Reported 3-seed mean val_bpb of 1.06844 under SP8192 rules.