PR #178
closedAdd Nuclear Stack submission: 1.16668 BPB (seed 2884431328)
by timowhite88View on GitHub
val_bpb
1.1667
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.8MB
Training Techniques
Architecture
MLP3x
Uses 3x MLP expansion with ReLU² activation.
parameters: {"hidden":1536}
SmearGate
Learned gating that blends each token with the previous token.
parameters: null
BigramHash
2048-bucket hash table for token-pair context.
parameters: {"buckets":2048}
GQA
Grouped-query attention with 8 heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.02
momentum: null
other_params: {"momentum_warmup":"0.92 -> 0.99"}
Weight Averaging
SWA
parameters: {"checkpoints_averaged":"7-8"}
Quantization
int6
bits: 6
scope: all
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":32}
Test-Time Training
full TTT
parameters: {"epochs":2,"learning_rate":0.002,"frozen_blocks":4}
Initialization
Orthogonal init
Orthogonal initialization with muP scaling.
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Regularization
weight decay
parameters: {"value":0.02}
Novel Contributions
- Combines architectural improvements with test-time training in a single submission
- Introduces SmearGate token blending
- Introduces BigramHash token-pair context hashing
- Uses 3x MLP expansion with ReLU² activation
- Applies SWA over multiple checkpoints
- Uses int6 mixed quantization with zstd compression
- Performs honest sliding-window evaluation that avoids double-counting tokens
- Applies full-model test-time training on validation data