val_bpb
1.1257
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.99 MB
Training Techniques
Quantization
GPTQ-lite
bits: 6
scope: all weights
Architecture
MLP3x
3x MLP expansion with relu-squared activation
parameters: {"expansion":3}
XSA
Efficient Partial XSA on the last 4 layers
parameters: {"last_n_layers":4}
RoPE
Partial RoPE with NTK-aware scaling
parameters: {"dimensions":"16/64"}
SmearGate
SmearGate gating mechanism
parameters: null
BigramHash
BigramHash with 2048 buckets and dim=128
parameters: {"buckets":2048,"dim":128}
KV head count
Grouped-query attention with 8 heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
Value Embedding
Shared value embedding used in later layers
parameters: {"dim":128,"layers":[9,10]}
Weight Averaging
SWA
parameters: {"every_steps":50,"checkpoint_count":12,"scale_threshold":0.2}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
self-distillation TTT
parameters: {"temperature":2,"freeze_blocks":4,"epochs":2,"learning_rate":0.001}
Initialization
Orthogonal init
Orthogonal initialization with projection scaling
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Regularization
layerwise LN scale
parameters: {"scale_rule":"1/sqrt(layer_idx+1)"}
Other
other
Late QAT with STE int6 applied when LR scale < 0.1
parameters: {"lr_scale_threshold":0.1}
Novel Contributions
- GPTQ-lite: per-layer optimal clip percentile search during int6 quantization
- Self-distillation TTT using a frozen teacher to preserve XSA attention patterns
- Late QAT with STE int6 during training