PR #249

open

non-record: int6 3xMLP + cosine warmdown (1.1704 bpb)

by kvmukilanView on GitHub
val_bpb
1.1704
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.5MB

Training Techniques

Quantization
int6
bits: 6
scope: all
Architecture
MLP3x
Widened the MLP from 2x to 3x using freed artifact budget from int6 quantization.
parameters: {"hidden":1536}
RoPE
Set RoPE base to 50k via environment variable.
parameters: {"base":50000}
KV head count
Manual KV head repeat for GQA compatibility instead of using enable_gqa.
parameters: null
Initialization
OrthoInit
Orthogonal weight initialization for all linear layers except zero-init layers.
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
cosine warmdown
parameters: {"formula":"0.5 * (1 + cos(pi * progress))"}
Regularization
weight decay
parameters: {"optimizer":"Muon WD"}

Novel Contributions

  • Switched from int8 to int6 quantization to free artifact budget
  • Used the freed budget to widen the MLP from 2x to 3x
  • Replaced linear LR decay with cosine warmdown
  • Applied orthogonal initialization to linear layers
  • Used zstd level 22 for artifact compression
  • Implemented manual KV head repeat for GQA compatibility