PR #1454

open

Non-record: TurboQuant + N-gram Hybrid Eval + TTT (1xH100 NVL)

by bsisduckView on GitHub

val_bpb

0.3509

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

14.92 MB

Training Techniques

Quantization

int5

bits: 5

scope: all

late QAT

bits: null

scope: all

Architecture

Gated Attention

Gated attention used in all 9 layers.

parameters: {"layers":9}

XSA

XSA attention used on the last 4 layers.

parameters: {"layers":4}

BigramHash

Bigram hash component for token statistics.

parameters: {"dimensions":1536}

Partial RoPE

RoPE applied to a subset of dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

Value Residual

Value residual connections across early layers.

parameters: {"layers":"1-8"}

SmearGate

Position-mixing gate used in the model.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Evaluation

stride-based eval

parameters: {"stride":384}

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.005,"temperature":1.1,"polyak_decay":0.995}

Regularization

complementary training

parameters: {"alpha":0.5}

Compression

lzma

level: null

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"newton_schulz":true}

Novel Contributions

TurboQuant random-rotation int5 quantization
Entropy-adaptive order-9 n-gram backoff cache
Complementary training to emphasize harder predictions
LoRA test-time training with tuned temperature and Polyak averaging
Hybrid neural + n-gram evaluation