PR #836

open

Full-Training QAT: 1.1219 bpb

by autocode-rayesView on GitHub

val_bpb

1.1219

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

—

Training Techniques

Quantization

QAT

bits: 6

scope: all

Architecture

LeakyReLU_LegalTTT_ParallelMuon

Existing SOTA Transformer architecture with LeakyReLU, LegalTTT, Parallel Muon, and related custom components.

parameters: null

XSA

Cross/self-attention variant used in the last 4 layers.

parameters: {"layers":4}

Partial RoPE

Rotary positional embeddings applied partially.

parameters: {"dimensions":"16/64"}

SmearGate

Custom gating mechanism included in the architecture.

parameters: null

BigramHash

Bigram hashing with bucketed representation.

parameters: {"buckets":2048}

MLP3x

MLP with 3x expansion and LeakyReLU(0.5)^2.

parameters: {"expansion":3}

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"during":"warmdown"}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"optimizer":"SGD","learning_rate":0.002,"epochs":3,"chunk_size":"32K"}

Compression

LZMA

level: null

LR Schedule

warmdown

parameters: null

Novel Contributions

Full-training QAT with int6 fake quantization enabled from step 1
Removing the mismatch between full-precision training and late-stage quantization noise
Using QAT_ENABLED=1 with LATE_QAT_THRESHOLD=1.0 to activate quantization immediately