val_bpb
0.0972
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.9 MB
Training Techniques
Architecture
BigramHash
Uses hashed n-gram/bigram-style context matching in the model.
parameters: {"dimensions":128,"buckets":4096}
SmearGate
Includes SmearGate as part of the architecture.
parameters: null
Value Residual
Uses value residual connections.
parameters: null
GQA
Grouped query attention with separate query and key/value head counts.
parameters: {"query_heads":8,"kv_heads":8}
ReLU²
MLP uses squared ReLU activations.
parameters: {"mlp_multiplier":3}
XSA
Uses XSA across all layers.
parameters: {"layers":11}
Partial RoPE
Applies partial rotary positional embeddings.
parameters: {"train_eval_ratio":"16/64"}
Regularization
LN scale
parameters: null
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.92
other_params: {"lr":0.02,"momentum_schedule_end":0.99}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
Quantization
mixed int6
bits: 6
scope: model
Compression
lzma
level: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Novel Contributions
- Extended n-gram backoff to order-14
- Enabled full-rescore two-pass evaluation with stored neural probabilities
- Increased alpha max to 0.70 for stronger high-order n-gram trust
- Reduced chunk size to 262,144 tokens for more frequent cache updates
- Maintained score-first legal evaluation while rescoreing all chunks with a warm cache