val_bpb
1.1531
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
12.72 MB
Training Techniques
Architecture
depth recurrence
Layers 3-5 are repeated 3 times during training/evaluation after activation at 35% progress.
parameters: {"layers":[3,4,5],"loops":3,"activation_frac":0.35}
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
Uses LeakyReLU squared activation.
parameters: {"squared":true,"negative_slope":0.5}
Parallel residuals
GPT-J style parallel residual connections starting from layer 7.
parameters: {"start_layer":7}
Partial RoPE
Applies rotary position embeddings to only part of the hidden dimensions.
parameters: {"dimensions":16}
Value Residual
Value embedding enabled for later layers.
parameters: {"dimension":128,"layers":[9,10]}
BigramHash
Adds a precomputed bigram hash embedding/bias feature.
parameters: {"vocab":2048,"dimension":128}
XSA
XSA module used in the last 4 layers.
parameters: {"layers":4}
Quantization
INT6
bits: 6
scope: all
STE QAT
bits: 6
scope: all
Optimizer
Parallel Muon
weight_decay: 0.095
momentum: 0.99
other_params: {"warmup_momentum_start":0.92,"warmup_steps":1500}
Adam
weight_decay: 0.095
momentum: null
other_params: {"beta1":0.9,"beta2":0.95}
Weight Averaging
EMA
parameters: {"decay":0.9965}
SWA
parameters: {"interval_steps":50,"lr_scale_threshold":0.2}
Compression
LZMA
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
LN scale
parameters: {"enabled":true}
LR Schedule
warmdown
parameters: {"warmdown_steps":5000}
Novel Contributions
- Novel two-layer BESE tokenizer with a 288-token vocabulary
- Structured 40-token base alphabet plus 248 BPE merges
- Byte-count-correct tokenizer design with proof of BPB invariance
- Reduced embedding table size versus SentencePiece to free budget for more model capacity
- Eval-time n-gram logit tilt using a precomputed bigram/trigram table
- Depth recurrence with parallel residuals under a 16MB artifact budget