PR #466

open

Record: 11L EMA + BigramHash(12288) + Mixed Int5 + FA3 (1.1354)

by simonbissonnetteView on GitHub
val_bpb
1.1354
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,967,704 bytes

Training Techniques

Architecture
BigramHash
Adds a bigram hashing component to the model.
parameters: {"buckets":12288,"dim":128}
tied embeddings
Uses tied input and output embeddings.
parameters: null
KV head count
Uses grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
Uses an MLP multiplier of 3.0.
parameters: {"multiplier":3}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
mixed int5
bits: 5
scope: mixed low-bit quantization
Evaluation
stride-based sliding window eval
parameters: {"stride":64}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025,"scalar_lr":0.025}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_frac":0.48}
Regularization
weight decay
parameters: {"value":0.04}
Other
other
Uses FlashAttention-3 via kernels-community/flash-attn3, which fetches the FA3 kernel package at runtime.
parameters: null

Novel Contributions

  • 11-layer, 512-dimensional GQA Transformer submission
  • BigramHash with 12288 buckets and 128-dimensional embeddings
  • EMA with decay 0.997
  • Mixed low-bit quantization using 5-bit attention and bigram quantization
  • Stride-64 sliding evaluation
  • FlashAttention-3 runtime path via kernels-community/flash-attn3