PR #467

open

[track_10min_16mb] 50-Epoch Cosine LoRA TTT + SOTA (10L Int5/Int6 BigramHash SWA) — Atharva Date (ADIITJ)

val_bpb
1.1428
Architecture
Transformer
Optimizer
Adam
Artifact Size
~14.3MB

Training Techniques

Quantization
mixed int5/int6
bits: null
scope: MLP weights int5, attention weights int6, embeddings fp16
Architecture
SmearGate
Learned gate blending token t with token t-1 embedding
parameters: null
BigramHash
Hashes consecutive token pairs into learned embeddings projected to model dimension
parameters: {"vocab_size":10240,"dim":128}
RoPE
Rotary positional embeddings with QK-Norm and q_gain
parameters: null
MLP3x
Transformer MLP uses 3x hidden width
parameters: {"multiplier":3}
tied embeddings
Input and output embeddings are tied
parameters: null
KV head count
Uses grouped-query attention with 8 heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
U-Net skip connections
Skip connections between encoder and decoder halves
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.02}
AdamW
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
SWA
parameters: {"start_frac":0.35,"every_steps":50}
Compression
zstd
level: 22
Test-Time Training
LoRA TTT
parameters: {"rank":8,"epochs":50,"learning_rate":0.001,"targets":["Q projections","V projections"],"layers":10,"score_first":true,"reset_between_documents":true}
Evaluation
score-first per chunk evaluation
parameters: {"chunk_size":256,"context_length":2048,"batch_size":32}
LR Schedule
warmdown + cosine decay
parameters: {"warmdown_iters":3500,"ttt_cosine_epochs":50}
Initialization
OrthoInit
Orthogonal initialization for large weight matrices
Regularization
weight decay
parameters: {"muon_weight_decay":0.04,"adamw_weight_decay":0.04,"grad_clip_norm":0.3,"pruning":"3% magnitude pruning"}

Novel Contributions

  • 50-epoch cosine-scheduled LoRA test-time training applied at evaluation time
  • Document-isolated LoRA adaptation with fresh adapter initialization and reset between documents
  • Score-first per chunk protocol within each TTT epoch to avoid leakage
  • Combining multi-epoch LoRA TTT with the SOTA 10-layer Int5/Int6 BigramHash + SWA training stack
  • Using rank-8 LoRA adapters on Q and V projections across all 10 attention layers