PR #1760
openNon-record: SP8192 + dim=464 + Pre-Quantization TTT + Brotli (1.1863 BPB)
by BrandtChristianView on GitHub
val_bpb
1.1863
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.92 MB
Training Techniques
Architecture
BigramHash
Uses a bigram hash component in the model stack.
parameters: {"size":1536}
XSA
Applies XSA in the last layers.
parameters: {"last_n_layers":4}
depth recurrence
Adds recurrent looping in selected depth layers.
parameters: {"layers":[3,4,5],"loops":2}
LeakyReLU
Uses LeakyReLU activation in the MLP.
parameters: {"slope":0.5,"mlp_multiplier":3}
parallel residuals
Introduces parallel residual connections starting from a later layer.
parameters: {"start_layer":7}
Quantization
QAT
bits: 6
scope: all layers
int8
bits: 8
scope: embeddings
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"row_normalize":true,"momentum_warmup_start":0.92,"momentum_warmup_steps":500}
Adam
weight_decay: 0.04
momentum: null
other_params: null
Compression
brotli
level: null
Test-Time Training
full TTT
parameters: {"epochs":7,"learning_rate":0.0005,"pre_quantization":true}
score-first TTT
parameters: {"epochs":3,"learning_rate":0.005}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Novel Contributions
- Pre-quantization TTT on the full validation set before INT6 quantization
- Scaling-law exploration showing improved roundtrip BPB with more preq-TTT epochs
- Brotli plus byte-shuffle artifact compression
- SP8192-based architecture with BigramHash, XSA, depth recurrence, and parallel residuals
- INT6 QAT for all layers with INT8 embeddings
- EMA + SWA weight averaging