val_bpb
1.1507
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.62MB
Training Techniques
Quantization
mixed int5/int6 QAT
bits: 5
scope: MLP and attention
Architecture
BigramHash
Adds a bigram hash embedding/cache-like component to the model.
parameters: {"size":4096,"dim":128}
SmearGate
Gating mechanism used in the architecture.
parameters: null
U-Net skips
Skip connections inspired by U-Net are added to the transformer blocks.
parameters: null
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"layers":10,"heads":8,"kv_heads":4,"d_model":512}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"matrix_lr":0.02}
Weight Averaging
SWA
parameters: {"fraction":0.4}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
LoRA TTT
parameters: {"rank":8}
Initialization
orthogonal init
Orthogonal weight initialization.
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Regularization
weight decay
parameters: {"weight_decay":0.04}
Other
other
Neural cache used during evaluation to interpolate cached hidden-state predictions with model outputs.
parameters: null
Novel Contributions
- Reduced BigramHash size for reliable artifact size margin across seeds
- Mixed int5 MLP / int6 attention quantization with post-quantization roundtrip
- Stochastic Weight Averaging over the last 40% of warmdown
- Neural cache evaluation-time interpolation
- Per-document LoRA test-time training
- Quantization-aware training with STE fake quantization