PR #713

open

Record: 10L + Batched LoRA TTT (mean val_bpb=1.1180, 3 seeds)

by hypery11View on GitHub

val_bpb

1.1180

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.75 MB

Training Techniques

Architecture

MLP3x

10-layer transformer with 3x MLP blocks using LeakyReLU(0.5)^2 activation.

parameters: {"layers":10,"dim":512}

BigramHash

Added a BigramHash component with bucketed hashing and learned projection.

parameters: {"buckets":10240,"dim":128}

SmearGate

Uses SmearGate and value residual connections with per-head gated attention.

parameters: null

weight tying

Tied embeddings with a 1024-vocab setup.

parameters: {"vocab_size":1024}

Quantization

mixed int5/int6

bits: null

scope: MLP int5, attention int6

Compression

zstd

level: 22

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"learning_rate":0.02}

AdamW

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.995}

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.01,"batch_size_docs":64,"chunk_length":256,"epochs":3,"targets":["Q","V","LM head"]}

Regularization

weight decay

parameters: {"value":0.04}

Other

other

Per-document batched test-time training with fresh initialization and optimizer reset for each document; short documents under 512 tokens are scored without TTT.

parameters: {"short_doc_threshold":512}

Novel Contributions

10-layer transformer with several custom architectural additions
Per-document batched LoRA test-time training
64 documents processed in parallel during TTT
Mixed int5/int6 quantization with zstd-22 compression
EMA weight averaging and Muon optimizer training
Score on final TTT epoch only