val_bpb
1.2252
Architecture
Hybrid
Optimizer
Muon
Artifact Size
~15.86 MB
Training Techniques
Architecture
token-shift mixing
Replaced most attention layers with RWKV-inspired local token-shift mixing that blends the current token with the previous token using learned per-dimension interpolation weights.
parameters: {"layers":8}
attention window
Used short-window quadratic attention in some layers, with the final attention layer retaining full context.
parameters: {"layers":3,"window":128}
GQA
Used grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
Used a 3x expansion MLP.
parameters: {"expansion":3}
LeakyReLU
Used LeakyReLU squared activation in the MLP instead of SwiGLU.
parameters: {"slope":0.5}
BigramHash
Added a hashed bigram embedding to capture local token-pair context.
parameters: null
SmearGate
Applied a learned gate after embedding normalization to blend each token with the previous token before the first layer.
parameters: null
Value Residual
Added a shared value embedding projected into KV-head space and applied with a learned per-layer scale.
parameters: {"layers":[9,10]}
XSA
Projected attention outputs away from the value direction to encourage diverse head representations.
parameters: {"layers":[7,10]}
Partial RoPE
Applied rotary position embeddings to only part of the head dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
Quantization
int6
bits: 6
scope: all
late QAT
bits: null
scope: all
Weight Averaging
EMA
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Compression
zlib
level: null
Regularization
logit softcap
parameters: {"threshold":30}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Novel Contributions
- Hybrid RWKV-inspired token-shift layers replacing most attention layers
- Short-window attention combined with a final full-context attention layer
- Learned per-dimension interpolation token mixing using previous-token blending
- Combination of hybrid architecture with BigramHash, SmearGate, XSA, Partial RoPE, EMA, late QAT, and Muon