PR #339

open

Record: 11L Backout + Int6 + SWA (val_bpb: 1.1364)

by sheeki03View on GitHub
val_bpb
1.1364
Architecture
Transformer
Optimizer
Muon + AdamW
Artifact Size
16.17 MB

Training Techniques

Quantization
mixed int6
bits: 6
scope: model weights
Architecture
Backout
Learned residual subtraction from a mid-network hidden state; subtracts lambda * h_mid from the final representation.
parameters: {"layer":5,"lambda_init":0.2}
SmearGate
Custom gating component used in the model architecture.
parameters: null
BigramHash
Bigram hashing component for vocabulary/features.
parameters: {"vocab_size":4096}
MLP3x
Expanded MLP width to 3x.
parameters: null
KV head count
Uses 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"warmdown_iters":3000,"adamw_weight_decay":0.04}
Weight Averaging
SWA
parameters: {"checkpoints_averaged":6}
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal initialization used for the model.
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":1500}
Regularization
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}

Novel Contributions

  • Backout Connection: learned residual subtraction from a mid-network hidden state
  • Improved validation bpb relative to the PR #198 baseline on the same hardware/run setup
  • SWA with int6 mixed quantization and zstd compression
  • Potential future artifact-size reduction via INT5_MLP=1