PR #1452

closed

Non-record: TurboQuant + N-gram Hybrid Eval + TTT (1xH100 NVL)

by bsisduckView on GitHub

val_bpb

0.3509

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

14.92 MB

Training Techniques

Quantization

int5

bits: 5

scope: all

late QAT

bits: null

scope: all

Architecture

Gated Attention

Gated attention used in all layers.

parameters: {"layers":9}

XSA

XSA attention used on the last 4 layers.

parameters: {"layers":4}

BigramHash

Bigram hash component used for hybrid n-gram modeling.

parameters: {"size":1536}

Partial RoPE

Rotary positional embeddings applied to a subset of dimensions.

parameters: {"dimensions":"16/64"}

Value Residual

Value residual connections used across early layers.

parameters: {"layers":"1-8"}

SmearGate

Position-mixing gate used in the model.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Evaluation

stride-based eval

parameters: {"stride":384}

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.005,"polyak_decay":0.995,"temperature":1.1}

Compression

lzma

level: null

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"newton_schulz":true}

Regularization

complementary training

parameters: {"alpha":0.5}

Other

other

Entropy-adaptive order-9 n-gram backoff cache mixed with neural probabilities during evaluation.

parameters: {"order":9,"min_count":1}

Novel Contributions

TurboQuant random-rotation int5 quantization
Entropy-adaptive order-9 n-gram backoff cache
Complementary training with alpha=0.5
LoRA test-time training with tuned temperature and Polyak averaging
Hybrid neural + n-gram evaluation