PR #362

closed

Record: 11L Int6+Zstd MLP3x SmearGate BigramHash OrthoInit MuonWD EMA (mean val_bpb=1.1497)

by mkenney2View on GitHub
val_bpb
1.1497
Architecture
Transformer
Optimizer
Muon
Artifact Size
~14.8MB

Training Techniques

Quantization
int6
bits: 6
scope: all
Architecture
MLP3x
Uses 3x MLP expansion with 1536 hidden dimension.
parameters: {"mlp_multiplier":3,"hidden_dim":1536}
SmearGate
Learned per-dimension gate blending each token with its predecessor.
parameters: null
BigramHash
Adds a 4096-bucket hash embedding for bigram context.
parameters: {"buckets":4096}
tied embeddings
Input and output embeddings are tied, with FP16 embeddings to avoid quantization degradation.
parameters: null
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Optimizer
Muon
weight_decay: 0.02
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":256}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":1200}
Regularization
weight decay
parameters: {"weight_decay":0.02}
Initialization
OrthoInit
Orthogonal weight initialization with projection scaling.

Novel Contributions

  • 11-layer Transformer with 3x MLP expansion
  • Int6 quantization combined with zstd-22 compression to fit a larger model under the artifact limit
  • SmearGate token-to-predecessor blending mechanism
  • BigramHash 4096-bucket hash embedding for bigram context
  • OrthoInit orthogonal initialization
  • Muon optimizer with weight decay 0.02
  • EMA with decay 0.997
  • FP16 tied embeddings
  • Sliding-window evaluation with stride 256
  • Extensive ablation of AttnRes, depth recurrence, sequence-length curriculum, and TTT