val_bpb
1.0745
Architecture
Transformer
Optimizer
AdamW
Artifact Size
<15.5 MB
Training Techniques
Quantization
GPTQ
bits: 5
scope: all
Architecture
BigramHash
Hashed bigram table used as part of the 5-expert context mixer / model additions.
parameters: {"size":6144,"dim":128}
XSA
Applied across all layers.
parameters: {"layers":11,"window_size":8}
Partial RoPE
Rotary positional embeddings applied partially.
parameters: {"dimensions":"16/64"}
MLP3x
Three-layer MLP with LeakyReLU activation.
parameters: {"activation":"LeakyReLU(0.5)^2"}
VE128
VE128 enabled in later layers.
parameters: {"layers":[9,10]}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":32,"seq_len":2048}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.0001,"chunk_tokens":131072,"epochs":3,"polyak_decay":0.998,"frozen_blocks":9}
Sequence Length
sequence_length
train_length: 131072
eval_length: 2048
LR Schedule
cosine decay
parameters: {"adaptive_lr_max_mult":3}
Regularization
layerwise LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
Other
other
5-expert Hedge/multiplicative-weights logistic context mixer blending neural, unigram, bigram, trigram, and entropy experts in log-probability space.
parameters: {"eta":0.1}
Novel Contributions
- 5-expert Hedge-based logistic context mixer
- Online GPU-vectorized context mixing in log-probability space
- Incremental n-gram tables built only from already-scored tokens
- Score-first test-time training pipeline
- GPTQ-calibrated model with int5 quantization and zstd compression