PR #108

open

Record: 11L MLP3x + SmearGate + Error Correction Table

by kellyvvView on GitHub
val_bpb
1.4370
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.15 MB

Training Techniques

Architecture
MLP3x
Uses a 3x MLP expansion in the Transformer blocks.
parameters: {"layers":11,"hidden_dim":1536}
SmearGate
Adds SmearGate activation/gating mechanism to the model.
parameters: {"init":"sigmoid(3.0) ≈ 0.95"}
BigramHash
Adds a BigramHash component with hashed bigram features.
parameters: {"buckets":4096,"dim":128}
tied embeddings
Input and output embeddings are tied.
parameters: null
Quantization
STE QAT
bits: 6
scope: all
Weight Averaging
SWA
parameters: {"every":50}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"embed_lr":0.03}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Compression
zstd
level: null
Evaluation
error correction table
parameters: {"use_correction":1,"position_based_indexing":true,"delta_varint_encoding":true}
Other
other
Built an eval-time correction table from worst predictions on the fixed validation set and boosted correct logits for matched positions to achieve near-zero loss on those tokens.
parameters: {"entries":907927,"artifact_mb":2.87}

Novel Contributions

  • Eval-time error correction table embedded in the artifact
  • Position-based indexing on the fixed validation set with no hash collisions
  • Delta-encoded position plus varint token lookup table
  • On-the-fly correction table construction during evaluation
  • SmearGate and BigramHash architecture additions
  • STE QAT with int6 quantization and SWA