PR #215

open

Non-Record: 11L Low-Rank on Q192 (val_bpb=1.1548) 14.7MB in decimal

by JayCheng113View on GitHub
val_bpb
1.1548
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.7MB

Training Techniques

Architecture
low-rank Q
Factorized the Q projection into down/up matrices with rank 192 to reduce parameters and improve compressibility.
parameters: {"rank":192}
depth
Used 11 transformer layers with encoder-decoder skip connections (5 encoder + 6 decoder).
parameters: {"layers":11,"encoder_layers":5,"decoder_layers":6}
tied embeddings
Used tied input/output embeddings.
parameters: null
KV head count
Used grouped-query attention with 4 KV heads.
parameters: {"kv_heads":4}
MLP3x
Used a 3x MLP width with relu-squared activation.
parameters: {"mlp_mult":3}
RoPE
Applied rotary positional embeddings.
parameters: {"base":10000}
Quantization
int6
bits: 6
scope: MLP and attention weights
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Optimizer
Muon
weight_decay: 0.038
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"warmdown_iters":3000}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Regularization
weight decay
parameters: {"weight_decay":0.038}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Initialization
resid mix
Explored Legendre-based initialization for resid_mix parameters, though it did not improve results.
Other
other
Clean compile cache was used for each run to ensure reproducibility and consistent compilation behavior.
parameters: null

Novel Contributions

  • Low-rank Q factorization with rank 192 to exploit the apparent low-rank structure of Q projections.
  • 11-layer encoder-decoder skip-connected Transformer within the 16MB budget.
  • Int6 per-row quantization combined with zstd-22 compression for model weights.
  • Sliding-window evaluation with stride 64 for final scoring.
  • Analysis-driven exploration of alternative ideas such as Legendre resid_mix initialization, content-dependent pre-rotation, and depth-attention residuals.