PR #2089

open

[Non-Record 16MB] Int8 QAT 7L d512 + Sliding Window + N-gram Backoff + TTT

by AlirezaAlampourView on GitHub

val_bpb

1.2093

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.5MB

Training Techniques

Architecture

GQA

Grouped query attention with 8 query heads and 4 KV heads.

parameters: {"query_heads":8,"kv_heads":4}

LeakyReLU

MLP uses LeakyReLU squared activation.

parameters: {"negative_slope":0.5}

weight tying

Tied embeddings are used.

parameters: null

U-Net skip connections

U-Net style encoder/decoder skip connections with learned skip weights.

parameters: null

Quantization

QAT

bits: 8

scope: per-row weights and per-tensor vectors/scalars

Weight Averaging

EMA

parameters: {"decay":0.997}

Optimizer

Muon

weight_decay: null

momentum: 0.9382982028913158

other_params: null

Compression

zlib

level: 9

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

Token-level 5-order n-gram backoff cache mixed with neural probabilities at evaluation time.

parameters: {"max_order":5,"alpha":0.2}

Test-Time Training

score-first TTT

parameters: {"steps":1,"target":"last block + ln_f per chunk"}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Regularization

logit softcap

parameters: {"softcap":30}

Novel Contributions

Int8 per-row QAT with zlib-compressed artifact under the 16MB limit
Sliding window evaluation with stride 64
5-order n-gram backoff cache mixed at evaluation time
Score-first test-time training on the last block and ln_f per chunk
GQA-based 7-layer decoder-only transformer with U-Net skip connections