PR #295
open[Record Submission] QAT Int5/Int6 + Backout + U-Net Skips + BigramHash(10240) + SWA50 — val_bpb=1.1477
by gowtham0992View on GitHub
val_bpb
1.1477
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.94 MB
Training Techniques
Quantization
STE QAT
bits: 5
scope: MLP
STE QAT
bits: 6
scope: attention
Architecture
Backout
Learned residual subtraction from the final output using a midpoint activation.
parameters: {"lambda_init":0.2}
U-Net skip connections
Encoder-decoder skip connections with learned per-dimension skip weights.
parameters: {"encoder_layers":5,"decoder_layers":5}
BigramHash
Hashes consecutive token pairs into a bucketed embedding table.
parameters: {"dimensions":10240}
SmearGate
Blends each token with the previous token's embedding.
parameters: null
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
MLP uses 3x expansion.
parameters: {"hidden_size":1536}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03}
AdamW
weight_decay: 0.01
momentum: null
other_params: {"scalar_lr":0.02}
Weight Averaging
SWA
parameters: {"every_steps":50,"start_frac":0.4}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
SVD spectral init
Tied embeddings initialized with spectral decay following a 1/sqrt(k) profile.
OrthoInit
Orthogonal initialization with muP-scaled output projections.
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Regularization
weight decay
parameters: {"muon_weight_decay":0.04,"adamw_weight_decay":0.01}
magnitude pruning
parameters: {"prune_frac":0.08}
Novel Contributions
- Quantization-aware training with STE using int5 MLP and int6 attention during training
- Backout: learned residual subtraction from the final output
- U-Net skip connections with learned per-dimension skip weights
- SVD embedding initialization with 1/sqrt(k) spectral decay