val_bpb
1.1455
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.94 MB
Training Techniques
Quantization
mixed int5/int6
bits: 5
scope: MLP weights (int5), attention weights (int6)
Architecture
SmearGate
Learned per-dimension gate blending each token embedding with the previous token's embedding to inject bigram context.
parameters: {"params":512}
BigramHash
2048-bucket hashed bigram embedding table projected into model dimension.
parameters: {"buckets":2048,"dimension":64,"projection_dim":512}
MLP3x
Expanded MLP hidden size to 3x model dimension with relu^2 activation.
parameters: {"multiplier":3,"hidden_size":1536}
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"warmup_momentum_start":0.92,"warmup_momentum_end":0.99,"warmup_steps":1500}
AdamW
weight_decay: 0.04
momentum: null
other_params: null
Test-Time Training
full TTT
parameters: {"optimizer":"SGD","learning_rate":0.002,"momentum":0.9,"epochs":2}
Weight Averaging
SWA
parameters: {"checkpoints":30,"interval_steps":50}
Initialization
OrthoInit
Orthogonal initialization for large matrices with muP scaling.
Regularization
weight decay
parameters: {"value":0.04}
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iterations":3000}
Compression
zstd
level: 22
Novel Contributions
- 11-layer transformer funded by mixed-precision quantization savings
- Mixed int5 MLP and int6 attention quantization to fit within the artifact budget
- SmearGate for injecting previous-token context
- BigramHash embedding for learned bigram context
- Full-model SGD test-time training to improve validation BPB
- Stochastic Weight Averaging over 30 checkpoints
- OrthoInit with muP scaling for stable training
- Sliding-window evaluation with stride 64