PR #583
openRecord: 10L Int5-MLP3x BigramHash4096 SlidingEval — mean val_bpb 1.1489
by suchihypeView on GitHub
val_bpb
1.1489
Architecture
Transformer
Optimizer
Muon + AdamW
Artifact Size
under 16MB
Training Techniques
Quantization
full-run Int6 QAT with STE
bits: 6
scope: all except MLP and embeddings
int5 quantization
bits: 5
scope: MLP
FP16
bits: 16
scope: embeddings
Architecture
MLP3x
MLP multiplier 3× with hidden dimension 1536 and ReLU² activation
parameters: {"multiplier":3,"hidden_dim":1536}
BigramHash
BigramHash embedding with vocab size 4096 and dimension 128
parameters: {"vocab":4096,"dim":128}
SmearGate
Per-dimension learned gate blending current and previous token
parameters: null
RoPE
Rotary positional embeddings with base 50000, full dimensions
parameters: {"base":50000,"partial":false}
Optimizer
Muon + AdamW
weight_decay: 0.045
momentum: 0.99
other_params: {"learning_rates":{"matrix":0.035,"tied_embed":0.045,"scalar":0.035},"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"grad_clip_norm":0.35,"warmdown_iters":2000,"warmup_steps":20,"batch_tokens":786432,"sequence_length":2048}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_iters":2000,"warmup_steps":20}
Novel Contributions
- Int6 STE must match export exactly to avoid bpb degradation
- EMA and SWA weight averaging hurt full-run QAT quantized models
- Higher learning rates (0.035/0.045) are optimal for fewer training steps (~5500)
- Int5 quantization on MLP enables larger MLP multiplier (3×) within 16MB cap
- Sliding window evaluation with stride 64 improves val_bpb by ~0.023 bpb
- Optuna TPE sweep found better training schedule than hand-tuning