PR #361

open

feat: Ultimate SOTA submission - 10L Model, Mixed Int6 QAT, and TTT/LoRA Evaluation

by adityagupta26View on GitHub

val_bpb

1.1400

Architecture

Transformer

Optimizer

Muon

Artifact Size

16MB

Training Techniques

Architecture

10L Transformer

Increased model depth to 10 Transformer layers.

parameters: {"layers":10}

MLP3x

Expanded the MLP hidden size to 3.0x the base dimension.

parameters: {"expansion_ratio":3}

SmearGate

Learned gating mechanism to blend information between adjacent tokens for local context.

parameters: null

BigramHash

Token-pair hashing embedding with 4096 buckets to capture bigram statistics at the input level.

parameters: {"buckets":4096}

U-Net skip connections

Added encoder-decoder style skip connections to stabilize gradient flow in deeper networks.

parameters: null

Quantization

mixed int6 QAT

bits: 6

scope: all

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"weight_decay":true}

Weight Averaging

SWA

parameters: {"start_fraction":0.5}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

LoRA TTT

parameters: {"rank":8}

Regularization

weight decay

parameters: null

Compression

zstd

level: 22

Other

other

Magnitude pruning of the smallest 3% of weights post-training to improve compression efficiency.

parameters: {"pruned_fraction":0.03}

Novel Contributions

10-layer Transformer with 3.0x MLP expansion
SmearGate local token blending mechanism
BigramHash embedding with 4096 buckets
U-Net style skip connections in the Transformer
Mixed int6 quantization-aware training with per-row scaling
Muon optimizer extended with weight decay
Stochastic Weight Averaging during the final half of training
Sliding-window evaluation with stride 64
Test-time training using batched LoRA adapters of rank 8
Magnitude pruning of 3% of weights
Zstandard level 22 artifact compression