PR #550

open

11L INT6 + Backward-Looking Per-Document LoRA TTT

by haimianbaobao007View on GitHub

val_bpb

1.1890

Architecture

Transformer

Optimizer

Muon

Artifact Size

11MB

Training Techniques

Quantization

INT6 QAT

bits: 6

scope: all

Architecture

BigramHash

Bigram hashing with 4096 buckets and 128 embedding dimension

parameters: {"buckets":4096,"embedding_dim":128}

SmearGate

SmearGate mechanism applied

parameters: null

U-Net skip connections

U-Net style skip connections in the transformer

parameters: null

tied embeddings

Input and output embeddings are tied

parameters: null

MLP3x

MLP with 3x expansion

parameters: {"expansion_factor":3}

LeakyReLU(0.5)^2 activation

LeakyReLU with negative slope 0.5 squared to preserve negative gradient flow

parameters: {"negative_slope":0.5}

Optimizer

Muon + Adam

weight_decay: null

momentum: null

other_params: {"Muon_scope":"matrices","Adam_scope":"scalars"}

Weight Averaging

EMA

parameters: {"decay":0.997}

LR Schedule

auto warmdown

parameters: {"warmdown_fraction":0.15}

Test-Time Training

LoRA TTT

parameters: {"rank":8,"target":"attention Q and V projections","epochs_per_document":10,"learning_rate_decay":"cosine decay from 0.01 to 0.0001","mode":"backward-looking (score-first)","per_document_reset":true,"last_chunk_no_train":true,"documents_less_than_512_tokens_no_TTT":true}

Novel Contributions

Backward-looking (score-first) per-document LoRA test-time training
Use of LoRA with rank 8 to constrain adaptation subspace and prevent overfitting on quantized models
Per-document independent LoRA with reset between documents to avoid cross-contamination
INT6 quantization-aware training (QAT) applied uniformly
Combination of Muon optimizer for matrices and Adam for scalars
LeakyReLU(0.5)^2 activation to preserve negative gradient flow
U-Net style skip connections in transformer architecture