PR #246

closed

non-record: int6 3xMLP + cosine warmdown (1.1704 bpb)

by kvmukilanView on GitHub
val_bpb
1.1704
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.5MB

Training Techniques

Quantization
int6
bits: 6
scope: all
Architecture
MLP3x
Widened the MLP from 2x to 3x using freed artifact space from int6 quantization.
parameters: {"hidden":1536}
LR Schedule
cosine warmdown
parameters: {"formula":"0.5 * (1 + cos(pi * progress))"}
Initialization
OrthoInit
Orthogonal weight initialization for all linear layers except zero-init ones.
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
RoPE base set to 50k via environment variable.
parameters: {"rope_base":50000}
other
Manual KV head repeat used for GQA compatibility instead of enable_gqa flag.
parameters: null

Novel Contributions

  • Switched from int8 to int6 quantization to free artifact budget
  • Used freed space to widen the MLP from 2x to 3x
  • Replaced linear LR decay with cosine warmdown
  • Applied orthogonal initialization to linear layers
  • Used zstd level 22 for artifact compression
  • Set RoPE base to 50k
  • Implemented a GQA compatibility fix via manual KV head repeat