PR #150
openRecord: 11L Int6 QAT + SmearGate + OrthoInit + SWA + TTT (val_bpb=1.1478)
by yahya010View on GitHub
val_bpb
1.1478
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.76MB
Training Techniques
Quantization
STE QAT
bits: 6
scope: all
Architecture
SmearGate
Learned sigmoid token blending
parameters: null
BigramHash
Hash embedding for bigrams
parameters: {"buckets":2048,"dim":128}
MLP3x
Expanded MLP hidden size to 3x the model dimension
parameters: {"hidden":1536}
tied embeddings
FP16 tied input/output embeddings
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
NTK-RoPE
Rotary positional embeddings with NTK scaling
parameters: {"base":50000}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"learning_rate":0.025}
Weight Averaging
SWA
parameters: {"checkpoints":8,"warmdown":true,"interval_steps":200}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":3,"freeze_first_blocks":2}
Initialization
OrthoInit
Orthogonal initialization with muP scaling for output projections
Novel Contributions
- 11-layer transformer with 3x MLP expansion
- STE int6 quantization-aware training with zero quantization gap
- SmearGate learned token blending
- BigramHash embedding augmentation
- OrthoInit with muP scaling for output projections
- SWA checkpoint averaging during warmdown
- Full-weight test-time training on validation data
- NTK-RoPE positional encoding
- Sliding window evaluation with stride 64