val_bpb
1.0807
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.8 MB
Training Techniques
Test-Time Training
full TTT
parameters: {"timing":"pre-quantization","epochs":10,"freeze":0,"per_block_lr_scaling":{"start":0.3,"end":1,"interpolation":"linear"}}
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"phase":"pre-quant TTT"}
Muon
weight_decay: null
momentum: null
other_params: {"parallel":true}
Quantization
GPTQ
bits: 6
scope: all
Architecture
BigramHash
Base model uses BigramHash 2048x128
parameters: {"dimensions":2048,"width":128}
XSA
Base model uses XSA-all attention variant
parameters: null
GQA
Base model uses grouped query attention
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}
Evaluation
sliding window eval
parameters: {"stride":64}
Novel Contributions
- Discriminative test-time training with per-block adaptive learning rates
- Linear LR interpolation across transformer blocks from 0.3x to 1.0x
- Freeze=0 all-block adaptation during pre-quant TTT
- Coprime-stride multi-shard data loader with weighted random shard sampling
- Improved configuration using QK_GAIN=5.0, WARMDOWN=4000, and GPTQ damp=0.005