PR #223
openDraft: SOTA+ TTT + RoPE50K + EMA + Curriculum (pending H100 run)
by 0xjaishyView on GitHub
val_bpb
1.1326
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.7MB
Training Techniques
Quantization
mixed int6/int8
bits: null
scope: MLP+Attn int6, embeddings int8
Compression
zstd
level: 22
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: null
Evaluation
sliding window eval
parameters: {"stride":64}
Architecture
SmearGate
Learned per-dimension gate blending token with predecessor
parameters: null
BigramHash
Hash-based token-pair embeddings
parameters: {"buckets":2048}
MLP3x
Wider feed-forward network with 3x hidden expansion
parameters: {"hidden":1536}
tied embeddings
Input and output embeddings are tied
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
RoPE
Rotary position embeddings with increased base for smoother interpolation
parameters: {"base":50000}
U-Net skip connections
Skip connections with learned weights
parameters: null
Initialization
OrthoInit
Orthogonal weight initialization with output scaling
Weight Averaging
EMA
parameters: {"decay":0.995}
Test-Time Training
full TTT
parameters: {"learning_rate":0.0003,"epochs":1,"momentum":0.95}
Sequence Length
sequence_length
train_length: 1024
eval_length: 2048
Other
other
Context-length curriculum: train at seq1024 for first 60% of wallclock, then switch to seq2048
parameters: {"phase1_fraction":0.6}
Novel Contributions
- RoPE base 50K for smoother position interpolation at sequence length 2048
- LAWA-EMA replacing periodic SWA with stepwise exponential moving average
- Context-length curriculum from seq1024 to seq2048 during training
- Full-model SGD test-time training on validation data before scoring