PR #243

closed

Record: Int6 3xMLP + Cosine Warmdown (val_bpb=1.1704)

by kvmukilanView on GitHub
val_bpb
1.1704
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.5MB

Training Techniques

Quantization
int6
bits: 6
scope: all
Architecture
MLP3x
Expanded MLP hidden width from 2x to 3x (hidden=1536) to increase parameter capacity.
parameters: {"hidden":1536}
RoPE
Increased RoPE base from 10000 to 50000 for better position allocation.
parameters: {"base":50000}
Initialization
OrthoInit
Orthogonal initialization for all non-zero-init linear layers to improve gradient flow.
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
cosine warmdown
parameters: null
Regularization
weight decay
parameters: {"weight_decay":0.02}

Novel Contributions

  • Int6 STE quantization to reduce artifact size and enable a wider model within the 16MB budget
  • 3x MLP width expansion (hidden=1536)
  • Cosine warmdown learning rate schedule
  • Orthogonal initialization
  • RoPE base increased to 50000