PR #286
openRecord: 10L Int5-MLP + SmearGate + BigramHash + Late QAT (val_bpb=1.1628)
by chris-buckleyView on GitHub
val_bpb
1.1628
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,481,841 bytes
Training Techniques
Quantization
mixed int5/int6
bits: null
scope: MLP int5, attention int6
QAT
bits: null
scope: final phase only
Architecture
SmearGate
gated residual smearing for cheap inter-token mixing
parameters: null
BigramHash
4096-bucket bigram embedding for token-pair context without a full bigram table
parameters: {"vocab_size":4096,"dimension":128}
Initialization
Orthogonal init
orthogonal initialization with muP-style output projection scaling for stable deep training
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"decoupled_weight_decay":true}
Weight Averaging
SWA
parameters: {"start_frac":0.5,"every_steps":50,"num_checkpoints":15}
Evaluation
sliding window eval
parameters: {"stride":64,"full_tail_handling":true}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Regularization
weight decay
parameters: {"muon_weight_decay":0.04,"adam_weight_decay":0.01}
Other
other
late QAT starting at 85% wallclock to avoid always-on STE instability while closing most of the quantization gap
parameters: {"start_frac":0.85}
Novel Contributions
- Mixed-precision int5 MLP / int6 attention export to fit a 10-layer model under the 16 MB cap
- SmearGate for cheap inter-token mixing without learned parameters
- BigramHash 4096-bucket bigram embedding for token-pair context
- Late QAT starting at 85% wallclock instead of always-on STE
- Orthogonal initialization with muP-style output projection scaling
- Decoupled Muon weight decay and SWA during warmdown
- Sliding-window evaluation with stride 64 and full-tail handling