val_bpb
0.2071
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.5 MB
Training Techniques
Quantization
GPTQ
bits: 5
scope: all
Compression
zstd
level: 22
Architecture
BigramHash
Added hashed n-gram / bigram cache component with multi-order backoff and order-adaptive gating.
parameters: {"orders":"2-9","buckets":4000000,"dim":128}
XSA
Applied XSA across all layers.
parameters: {"layers":11,"window_size":8}
MLP3x
Expanded MLP with LeakyReLU-based nonlinearity.
parameters: {"multiplier":3.5}
VE128
Added VE128 in upper layers.
parameters: {"layers":"9-10"}
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"learning_rate":0.0001,"epochs":4}
Weight Averaging
Polyak averaging
parameters: {"decay":0.998}
Evaluation
stride-based eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"epochs":4,"learning_rate":0.0001,"freeze_blocks":2,"chunk_tokens":131072}
LR Schedule
adaptive cosine decay
parameters: {"adaptive_lr":true,"adaptive_lr_max":3}
Regularization
CROWN-Q penalty
parameters: {"lambda":0.01}
pruning
parameters: {"pct":0.05,"type":"magnitude"}
Other
other
Order-adaptive entropy gating for n-gram cache with per-order thresholds and alpha multipliers.
parameters: {"high_order":9,"low_order":2}
other
Full-chunk cache sharing across all GPU ranks to increase n-gram data per rank.
parameters: {"ranks":8}
other
Adaptive temperature sharpening applied per token to compensate for under-confidence after quantization.
parameters: {"temperature":0.85}
other
Online logit calibration using momentum-EMA of empirical frequency versus predicted probability.
parameters: null
other
5-expert Hedge mixer combining neural, unigram, bigram, trigram, and entropy experts.
parameters: {"eta":0.1}
Novel Contributions
- Order-adaptive entropy gating for n-gram cache
- Multi-order n-gram backoff cache with orders 2-9
- Full-chunk cache sharing across 8 GPU ranks
- Score-first test-time training with Polyak EMA
- Adaptive temperature sharpening
- Online logit calibration
- 5-expert Hedge mixer
- CROWN-Q plus GPTQ int5 with pruning and zstd compression