PR #207

closed

Add 2026-03-20 11L dense-lexical submission candidate

by ajkpersonalView on GitHub
val_bpb
1.1568
Architecture
Transformer
Optimizer
Muon
Artifact Size
15704854 bytes

Training Techniques

Architecture
SmearGate
Adds a SmearGate component to the dense lexical model.
parameters: null
BigramHash
Uses a bigram hash feature/module for lexical modeling.
parameters: {"dimensions":4096,"embedding_dim":128}
MLP3x
Uses a 3x MLP expansion in the model.
parameters: null
Optimizer
Muon
weight_decay: 0.038
momentum: null
other_params: null
Regularization
weight decay
parameters: {"adam_weight_decay":0.01,"muon_weight_decay":0.038}
Weight Averaging
SWA
parameters: {"every":50,"start_frac":0.5}
Evaluation
sliding window eval
parameters: {"context_length":2048,"stride":256}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Compression
zstd
level: null

Novel Contributions

  • Dense lexical 11-layer 512-dim model with KV4 and MLP3x
  • SmearGate architecture component
  • BigramHash(4096 x 128) lexical feature module
  • Muon optimizer with weight decay 0.038
  • SWA training schedule
  • Legal re-export using int6_zstd_core to fit under the 16MB artifact cap
  • Doc-sliding evaluation with 2048 context and 256 stride