PR #525

open

Non-record: 10L + Batched LoRA TTT (val_bpb=1.1160)

by hypery11View on GitHub

val_bpb

1.1160

Architecture

Transformer

Optimizer

Muon + AdamW

Artifact Size

15.75 MB

Training Techniques

Quantization

mixed int5/int6

bits: null

scope: null

Architecture

MLP3x

3x MLP expansion with improved activations

parameters: null

KV head count

8 heads / 4 KV heads (GQA)

parameters: {"heads":8,"kv_heads":4}

tied embeddings

parameters: null

U-Net skip connections

U-Net style skip connections

parameters: null

Optimizer

Muon + AdamW

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA

parameters: null

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.01,"scope":"Q/V/LM-head, all layers","batch_size":64,"reset":"per-document","chunk_size":256,"epochs":3,"score_on":"final epoch"}

Novel Contributions

Batched per-document LoRA test-time training (TTT) with rank-8 LoRA on Q/V/LM-head across all layers
Mixed int5/int6 quantization combined with zstd-22 compression
Use of Muon optimizer combined with AdamW and EMA weight averaging
Architecture modifications including 10 layers, 512 dimension, 8/4 GQA heads, 3x MLP expansion, U-Net skip connections, and tied embeddings
Per-document reset during LoRA TTT with 64 documents batched in parallel and 256-token chunks