PR #918

open

Record: TurboQuant + Full-Rescore N-gram (val_bpb=0.1653)

by haikosysView on GitHub

val_bpb

0.1653

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.35 MB

Training Techniques

Architecture

BigramHash

Bigram hash embedding module used in the model.

parameters: {"dimensions":128,"hash_size":2048}

SmearGate

SmearGate component included in the architecture.

parameters: null

U-Net skip connections

U-Net style skip connections in the network.

parameters: null

Partial RoPE

Partial rotary positional embeddings applied to part of the model.

parameters: {"dimensions":16}

LeakyReLU

LeakyReLU squared activation used in the MLP.

parameters: {"squared":true,"negative_slope":0.5}

XSA

XSA used in the last 4 layers.

parameters: {"layers":4}

weight tying

Tied input and output embeddings.

parameters: null

Regularization

logit softcap

parameters: {"value":30}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.025}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"embeddings_lr":0.035,"scalars_lr":0.025}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"every_n_steps":50}

Quantization

QAT

bits: 2

scope: MLP up

QAT

bits: 3

scope: attn/MLP down

QAT

bits: 4

scope: embeddings

STE QAT

bits: null

scope: all

Compression

lzma

level: 6

Evaluation

sliding window eval

parameters: {"stride":64}

full-rescore n-gram cache

parameters: {"order_min":2,"order_max":12,"entropy_adaptive_alpha":true}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Novel Contributions

TurboQuant rotation-based Lloyd-Max codebook quantization for weight compression
Progressive QAT warmdown from 4-bit to 3-bit to 2-bit
Two-pass full-rescore n-gram cache evaluation with entropy-adaptive alpha blending
Combining higher-parameter TurboQuant models with full-rescore n-gram cache to recover validation performance