PR #713

open

Record: 10L + Batched LoRA TTT (mean val_bpb=1.1180, 3 seeds)

by hypery11View on GitHub
val_bpb
1.1180
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.75 MB

Training Techniques

Architecture
MLP3x
10-layer transformer with 3x MLP blocks using LeakyReLU(0.5)^2 activation.
parameters: {"layers":10,"dim":512}
BigramHash
Added a BigramHash component with bucketed hashing and learned projection.
parameters: {"buckets":10240,"dim":128}
SmearGate
Uses SmearGate and value residual connections with per-head gated attention.
parameters: null
weight tying
Tied embeddings with a 1024-vocab setup.
parameters: {"vocab_size":1024}
Quantization
mixed int5/int6
bits: null
scope: MLP int5, attention int6
Compression
zstd
level: 22
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"learning_rate":0.02}
AdamW
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.995}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01,"batch_size_docs":64,"chunk_length":256,"epochs":3,"targets":["Q","V","LM head"]}
Regularization
weight decay
parameters: {"value":0.04}
Other
other
Per-document batched test-time training with fresh initialization and optimizer reset for each document; short documents under 512 tokens are scored without TTT.
parameters: {"short_doc_threshold":512}

Novel Contributions

  • 10-layer transformer with several custom architectural additions
  • Per-document batched LoRA test-time training
  • 64 documents processed in parallel during TTT
  • Mixed int5/int6 quantization with zstd-22 compression
  • EMA weight averaging and Muon optimizer training
  • Score on final TTT epoch only