val_bpb
0.0280
Architecture
Transformer
Optimizer
Muon
Artifact Size
12.8 MB
Training Techniques
Architecture
BigramHash
GPU-native multi-order backoff n-gram hashing tables for oracle predictions
parameters: {"orders":"2-16","buckets":4194304}
LeakyReLU
LeakyReLU squared activation in the MLP
parameters: {"squared":true}
GQA
Grouped query attention
parameters: {"query_heads":8,"kv_heads":4}
XSA
XSA used across all layers
parameters: {"layers":11}
VE128
Value residual enhancement in later layers
parameters: {"layers":[9,10]}
MLP3x
Three-times MLP stack
parameters: null
RoPE
Partial rotary positional embeddings
parameters: {"dimensions":"16/64"}
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997,"swa_interval":50}
Quantization
GPTQ-lite
bits: 6
scope: base model
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"epochs":1,"learning_rate":0.001}
Sequence Length
sequence_length
train_length: null
eval_length: 32000
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adam":true}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Novel Contributions
- Order-16 frozen n-gram oracle prefilled from all 8B training tokens
- Score-first TTT where each eval chunk is fully scored before any updates
- BackoffNgramMixer with GPU-native order-2 through order-16 hashing
- Complementary training that downweights tokens already well predicted by the oracle
- Order-16 scaling chosen as the best BPB/eval-time tradeoff under budget