PR #550

open

11L INT6 + Backward-Looking Per-Document LoRA TTT

by haimianbaobao007View on GitHub
val_bpb
1.1890
Architecture
Transformer
Optimizer
Muon
Artifact Size
11MB

Training Techniques

Quantization
INT6 QAT
bits: 6
scope: all
Architecture
BigramHash
Bigram hashing with 4096 buckets and 128 embedding dimension
parameters: {"buckets":4096,"embedding_dim":128}
SmearGate
SmearGate mechanism applied
parameters: null
U-Net skip connections
U-Net style skip connections in the transformer
parameters: null
tied embeddings
Input and output embeddings are tied
parameters: null
MLP3x
MLP with 3x expansion
parameters: {"expansion_factor":3}
LeakyReLU(0.5)^2 activation
LeakyReLU with negative slope 0.5 squared to preserve negative gradient flow
parameters: {"negative_slope":0.5}
Optimizer
Muon + Adam
weight_decay: null
momentum: null
other_params: {"Muon_scope":"matrices","Adam_scope":"scalars"}
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
auto warmdown
parameters: {"warmdown_fraction":0.15}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"target":"attention Q and V projections","epochs_per_document":10,"learning_rate_decay":"cosine decay from 0.01 to 0.0001","mode":"backward-looking (score-first)","per_document_reset":true,"last_chunk_no_train":true,"documents_less_than_512_tokens_no_TTT":true}

Novel Contributions

  • Backward-looking (score-first) per-document LoRA test-time training
  • Use of LoRA with rank 8 to constrain adaptation subspace and prevent overfitting on quantized models
  • Per-document independent LoRA with reset between documents to avoid cross-contamination
  • INT6 quantization-aware training (QAT) applied uniformly
  • Combination of Muon optimizer for matrices and Adam for scalars
  • LeakyReLU(0.5)^2 activation to preserve negative gradient flow
  • U-Net style skip connections in transformer architecture