PR #1867

open

Train gpt 0427 - 1.708bpb

by lijuncheng16View on GitHub
val_bpb
1.7080
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Architecture
XSA
XSA applied across all layers (XSA_LAST_N=11).
parameters: {"layers":"all"}
Quantization
GPTQ
bits: null
scope: grouped
Other
other
Sparse Matrix Tuning enabled by default with block_size=64, keep_frac=0.25, skip_embed=1.
parameters: {"block_size":64,"keep_frac":0.25,"skip_embed":1}
other
Eval-time logit bias hook infrastructure added (ETLB_ENABLED=0 by default).
parameters: {"enabled":false,"lr":0.05,"steps":5,"clip":3}
other
QK_GAIN_INIT retuned to 5.25.
parameters: {"qk_gain_init":5.25}
Test-Time Training
TTT
parameters: {"enabled":false}
Weight Averaging
EMA
parameters: null
LR Schedule
warmdown
parameters: null

Novel Contributions

  • SMT enabled by default with block_size=64, keep_frac=0.25, skip_embed=1
  • XSA extended to all layers
  • GPTQ grouped quantization with group size 64
  • ETLB infrastructure added for eval-time logit biasing
  • QK_GAIN_INIT retuned to 5.25
  • TTT moved to opt-in/disabled by default