val_bpb
0.4188
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.66 MB
Training Techniques
Quantization
mixed int5/int6
bits: null
scope: MLP int5; attention/embeddings int6
Architecture
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
Transformer MLP widened to 3.0x.
parameters: {"multiplier":3}
LeakyReLU
MLP activation uses LeakyReLU(0.5)^2.
parameters: {"slope":0.5}
SmearGate
SmearGate module included in the base neural stack.
parameters: null
BigramHash
BigramHash used in the model stack and n-gram context handling.
parameters: {"size":2048}
VE128
Value-Residual Embeddings with 128 dimensions.
parameters: {"dimensions":128}
Initialization
OrthoInit
Orthogonal initialization used in the base stack.
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Compression
lzma
level: null
Other
other
Complementary training that down-weights tokens easily predicted by bigram statistics.
parameters: {"loss_weighting":"1 - alpha * p_bigram(token)"}
other
Causal backoff n-gram mixer with entropy-adaptive blending.
parameters: null
other
Score-first, DDP-safe evaluation protocol with synchronization before cache updates.
parameters: {"ddp_safe":true,"score_first":true}
Novel Contributions
- Mixed precision quantization with int5 MLP weights and int6 attention/embedding weights
- Complementary training focused on tokens poorly predicted by n-grams
- Strictly causal backoff n-gram mixer with entropy-adaptive blending
- Score-first, DDP-safe cache update protocol for multi-GPU evaluation
- Artifact compression with lzma to fit within the 16MB limit