PR #525

open

Non-record: 10L + Batched LoRA TTT (val_bpb=1.1160)

by hypery11View on GitHub
val_bpb
1.1160
Architecture
Transformer
Optimizer
Muon + AdamW
Artifact Size
15.75 MB

Training Techniques

Quantization
mixed int5/int6
bits: null
scope: null
Architecture
MLP3x
3x MLP expansion with improved activations
parameters: null
KV head count
8 heads / 4 KV heads (GQA)
parameters: {"heads":8,"kv_heads":4}
tied embeddings
tied embeddings
parameters: null
U-Net skip connections
U-Net style skip connections
parameters: null
Optimizer
Muon + AdamW
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: null
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01,"scope":"Q/V/LM-head, all layers","batch_size":64,"reset":"per-document","chunk_size":256,"epochs":3,"score_on":"final epoch"}

Novel Contributions

  • Batched per-document LoRA test-time training (TTT) with rank-8 LoRA on Q/V/LM-head across all layers
  • Mixed int5/int6 quantization combined with zstd-22 compression
  • Use of Muon optimizer combined with AdamW and EMA weight averaging
  • Architecture modifications including 10 layers, 512 dimension, 8/4 GQA heads, 3x MLP expansion, U-Net skip connections, and tied embeddings
  • Per-document reset during LoRA TTT with 64 documents batched in parallel and 256-token chunks