PR #557

open

Non-record: 10L + Batched LoRA TTT (val_bpb=1.1160)

by hypery11View on GitHub

val_bpb

1.1160

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.75 MB

Training Techniques

Quantization

mixed int5/int6

bits: null

scope: MLP (int5), attention (int6)

Architecture

BigramHash

Hashing bigrams into 10240 buckets with 128 dim embeddings

parameters: {"buckets":10240,"embedding_dim":128}

SmearGate

Gating mechanism applied in the model

parameters: null

value residual

Residual connection on value vectors

parameters: null

gated attention

Attention mechanism with gating

parameters: null

MLP3x with LeakyReLU(0.5)^2

Three-layer MLP with squared LeakyReLU activation

parameters: {"activation":"LeakyReLU(0.5)^2","layers":3}

weight tying

Tied embeddings

parameters: null

U-Net skip connections

Skip connections inspired by U-Net architecture

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.02}

AdamW

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.995}

Compression

zstd

level: 22

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.01,"scope":"Q, V projections + LM head across all layers","batch_size":64,"per_document_reset":true,"optimizer":"Adam","adam_betas":[0.9,0.95],"chunk_size":256,"epochs":3,"scoring":"final epoch only","document_split":"BOS boundaries"}

Novel Contributions

Batched per-document LoRA test-time training with rank-8 LoRA on Q/V/LM-head across all layers
64 documents batched in parallel for LoRA TTT with per-document fresh initialization and optimizer reset
Use of mixed int5 (MLP) and int6 (attention) quantization combined with zstd-22 compression
Architecture modifications including BigramHash, SmearGate, value residual, gated attention, U-Net skip connections, and 3x MLP with LeakyReLU(0.5)^2
EMA weight averaging with decay 0.995
Efficient training with Muon optimizer combined with AdamW