PR #150

open

Record: 11L Int6 QAT + SmearGate + OrthoInit + SWA + TTT (val_bpb=1.1478)

by yahya010View on GitHub

val_bpb

1.1478

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.76MB

Training Techniques

Quantization

STE QAT

bits: 6

scope: all

Architecture

SmearGate

Learned sigmoid token blending

parameters: null

BigramHash

Hash embedding for bigrams

parameters: {"buckets":2048,"dim":128}

MLP3x

Expanded MLP hidden size to 3x the model dimension

parameters: {"hidden":1536}

tied embeddings

FP16 tied input/output embeddings

parameters: null

KV head count

Grouped-query attention with fewer KV heads than attention heads

parameters: {"heads":8,"kv_heads":4}

NTK-RoPE

Rotary positional embeddings with NTK scaling

parameters: {"base":50000}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"learning_rate":0.025}

Weight Averaging

SWA

parameters: {"checkpoints":8,"warmdown":true,"interval_steps":200}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

full TTT

parameters: {"learning_rate":0.002,"epochs":3,"freeze_first_blocks":2}

Initialization

OrthoInit

Orthogonal initialization with muP scaling for output projections

Novel Contributions

11-layer transformer with 3x MLP expansion
STE int6 quantization-aware training with zero quantization gap
SmearGate learned token blending
BigramHash embedding augmentation
OrthoInit with muP scaling for output projections
SWA checkpoint averaging during warmdown
Full-weight test-time training on validation data
NTK-RoPE positional encoding
Sliding window evaluation with stride 64