PR #575

open

Add 10min/16MB record: skinny RLM seq2048 (int8+zlib val_bpb 1.1750)

by k-oconnorView on GitHub
val_bpb
1.1750
Architecture
Looped Transformer (RLM)
Optimizer
Muon (matrix) + Adam (scalars)
Artifact Size
14.9MB

Training Techniques

Quantization
STE QAT
bits: 8
scope: embeddings
Architecture
depth recurrence, weight tying, tied embeddings, RoPE, ReLU² MLP 3×, GQA
Looped Transformer with prefix and suffix as 6 distinct blocks, middle uses 2 weight-tied blocks applied 3 times; uses GQA attention, RoPE positional embeddings, ReLU squared MLP with 3× expansion, tied embeddings
parameters: {"layers":6,"loop_blocks":2,"loop_iters":3,"embed_dim":512,"num_heads":8,"num_kv_heads":8,"mlp_expansion":3}
Optimizer
Muon + Adam
weight_decay: 0.04
momentum: null
other_params: {"MATRIX_LR":0.02,"SCALAR_LR":0.02,"TIED_EMBED_LR":0.05}
Regularization
weight decay
parameters: {"MUON_WD":0.04,"decoupled":true,"purpose":"compression headroom"}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":200,"warmdown_iters":3000}
Compression
zlib
level: null

Novel Contributions

  • Use of looped transformer architecture with 2 weight-tied blocks applied 3 times in the middle layers
  • Skinny RLM with 512d embedding and 6 layers, sequence length 2048
  • Combination of Muon optimizer for matrix parameters and Adam for scalar parameters with specific learning rates
  • Embedding quantization using 8-bit STE fake-quant
  • Achieving a final int8+zlib compressed model artifact under 16MB with val_bpb ~1.175
  • Use of ReLU squared MLP with 3× expansion and GQA attention
  • Decoupled weight decay tuned for compression headroom