PR #467

open

[track_10min_16mb] 50-Epoch Cosine LoRA TTT + SOTA (10L Int5/Int6 BigramHash SWA) — Atharva Date (ADIITJ)

by ADIITJView on GitHub

val_bpb

1.1428

Architecture

Transformer

Optimizer

Adam

Artifact Size

~14.3MB

Training Techniques

Quantization

mixed int5/int6

bits: null

scope: MLP weights int5, attention weights int6, embeddings fp16

Architecture

SmearGate

Learned gate blending token t with token t-1 embedding

parameters: null

BigramHash

Hashes consecutive token pairs into learned embeddings projected to model dimension

parameters: {"vocab_size":10240,"dim":128}

RoPE

Rotary positional embeddings with QK-Norm and q_gain

parameters: null

MLP3x

Transformer MLP uses 3x hidden width

parameters: {"multiplier":3}

tied embeddings

Input and output embeddings are tied

parameters: null

KV head count

Uses grouped-query attention with 8 heads and 4 KV heads

parameters: {"heads":8,"kv_heads":4}

U-Net skip connections

Skip connections between encoder and decoder halves

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.02}

AdamW

weight_decay: 0.04

momentum: null

other_params: null

Weight Averaging

SWA

parameters: {"start_frac":0.35,"every_steps":50}

Compression

zstd

level: 22

Test-Time Training

LoRA TTT

parameters: {"rank":8,"epochs":50,"learning_rate":0.001,"targets":["Q projections","V projections"],"layers":10,"score_first":true,"reset_between_documents":true}

Evaluation

score-first per chunk evaluation

parameters: {"chunk_size":256,"context_length":2048,"batch_size":32}

LR Schedule

warmdown + cosine decay

parameters: {"warmdown_iters":3500,"ttt_cosine_epochs":50}

Initialization

OrthoInit

Orthogonal initialization for large weight matrices

Regularization

weight decay

parameters: {"muon_weight_decay":0.04,"adamw_weight_decay":0.04,"grad_clip_norm":0.3,"pruning":"3% magnitude pruning"}

Novel Contributions

50-epoch cosine-scheduled LoRA test-time training applied at evaluation time
Document-isolated LoRA adaptation with fresh adapter initialization and reset between documents
Score-first per chunk protocol within each TTT epoch to avoid leakage
Combining multi-epoch LoRA TTT with the SOTA 10-layer Int5/Int6 BigramHash + SWA training stack
Using rank-8 LoRA adapters on Q and V projections across all 10 attention layers