PR #361

open

feat: Ultimate SOTA submission - 10L Model, Mixed Int6 QAT, and TTT/LoRA Evaluation

by adityagupta26View on GitHub
val_bpb
1.1400
Architecture
Transformer
Optimizer
Muon
Artifact Size
16MB

Training Techniques

Architecture
10L Transformer
Increased model depth to 10 Transformer layers.
parameters: {"layers":10}
MLP3x
Expanded the MLP hidden size to 3.0x the base dimension.
parameters: {"expansion_ratio":3}
SmearGate
Learned gating mechanism to blend information between adjacent tokens for local context.
parameters: null
BigramHash
Token-pair hashing embedding with 4096 buckets to capture bigram statistics at the input level.
parameters: {"buckets":4096}
U-Net skip connections
Added encoder-decoder style skip connections to stabilize gradient flow in deeper networks.
parameters: null
Quantization
mixed int6 QAT
bits: 6
scope: all
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"weight_decay":true}
Weight Averaging
SWA
parameters: {"start_fraction":0.5}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
LoRA TTT
parameters: {"rank":8}
Regularization
weight decay
parameters: null
Compression
zstd
level: 22
Other
other
Magnitude pruning of the smallest 3% of weights post-training to improve compression efficiency.
parameters: {"pruned_fraction":0.03}

Novel Contributions

  • 10-layer Transformer with 3.0x MLP expansion
  • SmearGate local token blending mechanism
  • BigramHash embedding with 4096 buckets
  • U-Net style skip connections in the Transformer
  • Mixed int6 quantization-aware training with per-row scaling
  • Muon optimizer extended with weight decay
  • Stochastic Weight Averaging during the final half of training
  • Sliding-window evaluation with stride 64
  • Test-time training using batched LoRA adapters of rank 8
  • Magnitude pruning of 3% of weights
  • Zstandard level 22 artifact compression