val_bpb
0.7227
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.45 MB
Training Techniques
Architecture
MLP3x
Uses a 3x MLP with ReLU-squared activation.
parameters: null
SmearGate
Adds SmearGate to the model architecture.
parameters: null
BigramHash
Uses BigramHash features for token interactions.
parameters: {"size":2048}
U-Net skip connections
Introduces encoder/decoder-style skip connections.
parameters: null
tied embeddings
Shares input embedding and output projection weights.
parameters: null
Quantization
int6
bits: 6
scope: all weights with FP16 passthrough for embeddings and control tensors
Compression
zstd
level: 22
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"newton_schulz":true,"compiled":true}
AdamW
weight_decay: null
momentum: null
other_params: {"fused":true}
Weight Averaging
EMA
parameters: {"decay":0.999,"every_steps":10}
SWA
parameters: {"checkpoints":11}
Test-Time Training
LoRA TTT
parameters: {"rank_qv":8,"rank_lm_head":16,"learning_rate":0.01,"epochs":6,"batch_docs_per_gpu":64}
LR Schedule
warmdown + cosine decay
parameters: {"warmdown_steps":6000,"per_step_cosine_decay":true}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Regularization
gradient clipping
parameters: {"max_norm":1}
Other
other
Late QAT during warmdown.
parameters: null
other
FlashAttention-3 integration for faster causal attention on H100.
parameters: null
other
Rotary cache .clone() fix to resolve CUDA graph conflict with FlashAttention-3.
parameters: null
Novel Contributions
- FlashAttention-3 integration for faster attention on H100
- Rotary cache .clone() fix for CUDA graph compatibility with FlashAttention-3
- LoRA-based test-time training with per-document adaptation
- Per-layer learning rates for LoRA and bias parameters during TTT
- Score-every-epoch backward-looking evaluation compliant with Issue #402
- Late QAT combined with int6 quantization and zstd compression