val_bpb
0.2532
Architecture
Transformer
Optimizer
Muon
Artifact Size
~11.4MB
Training Techniques
Architecture
ReLU²
Squared ReLU activation used in the model.
parameters: null
LeakyReLU
Leaky ReLU squared activation variant used in the model.
parameters: null
HybridNorm
Mixed pre-norm and post-norm scheme, with post-norm in deeper layers.
parameters: null
SmearGate
SmearGate combined with BigramHash for token mixing.
parameters: null
BigramHash
Bigram hash / embedding component used in the architecture.
parameters: null
Differential Attention
Attention modification using differential attention.
parameters: null
WaveletGPT
Wavelet-based GPT architectural variant.
parameters: null
VGA
VGA architectural component included in the model.
parameters: null
Multi-Token Prediction
Auxiliary multi-token prediction heads used during training.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Quantization
QAT
bits: 6
scope: all
GPTQ
bits: null
scope: all
Regularization
magnitude pruning
parameters: {"sparsity":"2%"}
Other
other
OptRot Hadamard rotation applied before quantization to improve error distribution.
parameters: null
Compression
zlib
level: null
Test-Time Training
LoRA TTT
parameters: {"rank":8}
Evaluation
n-gram cache
parameters: {"orders":"2-10","buckets":4000000,"entropy_adaptive_alpha":true}
kNN-LM
parameters: {"projection":"1024->64","storage":"fp16"}
TurboQuant KV cache compression
parameters: {"bits":3}
Weight Averaging
EMA
parameters: null
SWA
parameters: null
Novel Contributions
- Combines many techniques into a single env-var-toggleable pipeline
- Uses int6 QAT from step 0 with GPTQ and pruning to fit under the artifact limit
- Applies per-document LoRA test-time training
- Adds entropy-adaptive n-gram backoff caching
- Adds kNN-LM with random projection and fp16 storage
- Uses TurboQuant KV cache compression
- Reports strong ablation results across three seeds