val_bpb
1.1207
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.97 MB
Training Techniques
Architecture
weight tying
Tied input and output embeddings.
parameters: null
BigramHash
Adds a complementary bigram transition-statistics channel.
parameters: {"buckets":4096,"dimensions":64}
GQA
Uses grouped query attention with fewer KV heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
Uses LeakyReLU squared in the MLP.
parameters: {"slope":0.5}
Partial RoPE
Applies rotary position embeddings to a subset of dimensions.
parameters: {"dims":"16/64"}
SmearGate
Uses a position-mixing gate.
parameters: null
U-Net skip connections
Adds encoder-decoder style skip connections.
parameters: null
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
magnitude pruning
parameters: {"values":"±1","selective":true}
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997,"swa_interval":50}
Quantization
GPTQ
bits: 6
scope: MLP/attention body
STE QAT
bits: 6
scope: parameter banks
Compression
lzma
level: 9
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"parameter_banking":true}
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
sliding window eval
parameters: {"stride":16}
Test-Time Training
score-first TTT
parameters: {"stride":64}
score-first TTT
parameters: {"stride":16}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Novel Contributions
- Increasing the BPE vocabulary from 1,024 to 8,192 tokens as an entropy-optimized scaling variable.
- Using mutual-information spectrum analysis of FineWeb to guide vocabulary sizing.
- Rebalancing parameters from MLP capacity into a much larger embedding table while keeping the same overall Transformer shape.
- Showing that most per-layer techniques become neutral or negative at V=8192, with BigramHash as the main complementary exception.
- Demonstrating that quantization precision is the main binding constraint, with int7 improving BPB but exceeding the artifact budget.
- Analyzing vocabulary-size and sequence-length substitutability as a joint scaling effect.