PR #361
openfeat: Ultimate SOTA submission - 10L Model, Mixed Int6 QAT, and TTT/LoRA Evaluation
by adityagupta26View on GitHub
val_bpb
1.1400
Architecture
Transformer
Optimizer
Muon
Artifact Size
16MB
Training Techniques
Architecture
10L Transformer
Increased model depth to 10 Transformer layers.
parameters: {"layers":10}
MLP3x
Expanded the MLP hidden size to 3.0x the base dimension.
parameters: {"expansion_ratio":3}
SmearGate
Learned gating mechanism to blend information between adjacent tokens for local context.
parameters: null
BigramHash
Token-pair hashing embedding with 4096 buckets to capture bigram statistics at the input level.
parameters: {"buckets":4096}
U-Net skip connections
Added encoder-decoder style skip connections to stabilize gradient flow in deeper networks.
parameters: null
Quantization
mixed int6 QAT
bits: 6
scope: all
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"weight_decay":true}
Weight Averaging
SWA
parameters: {"start_fraction":0.5}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
LoRA TTT
parameters: {"rank":8}
Regularization
weight decay
parameters: null
Compression
zstd
level: 22
Other
other
Magnitude pruning of the smallest 3% of weights post-training to improve compression efficiency.
parameters: {"pruned_fraction":0.03}
Novel Contributions
- 10-layer Transformer with 3.0x MLP expansion
- SmearGate local token blending mechanism
- BigramHash embedding with 4096 buckets
- U-Net style skip connections in the Transformer
- Mixed int6 quantization-aware training with per-row scaling
- Muon optimizer extended with weight decay
- Stochastic Weight Averaging during the final half of training
- Sliding-window evaluation with stride 64
- Test-time training using batched LoRA adapters of rank 8
- Magnitude pruning of 3% of weights
- Zstandard level 22 artifact compression