PR #310

open

Record: 10L Seq2048 TTT LoRA WarmdownQuant (val_bpb=1.1787)

by vishesh9131View on GitHub

val_bpb

1.1787

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.56 MB

Training Techniques

Architecture

Transformer depth / tied embeddings / KV head count

10-layer transformer with 512-dimensional hidden size, 8 attention heads, 4 KV heads, and tied embeddings.

parameters: {"layers":10,"dimensions":512,"heads":8,"kv_heads":4}

weight tying

Tied embeddings with FP16 embeddings used to avoid int8 error compounding.

parameters: null

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Optimizer

Muon

weight_decay: null

momentum: 0.98

other_params: {"matrix_lr":0.03,"scalar_lr":0.03}

LR Schedule

warmdown

parameters: {"warmdown_steps":15000,"always_decaying":true}

Regularization

gradient clipping

parameters: {"grad_clip_norm":1}

Test-Time Training

LoRA TTT

parameters: {"rank":8,"targets":["Q projections","V projections","LM head"],"chunk_size":256}

Initialization

spectral init / residual mixing

Overtone spectral embedding initialization with phase-transition residual mixing.

Quantization

int8

bits: 8

scope: per-row weights

Compression

zlib

level: null

Novel Contributions

10-layer transformer with tuned hyperparameters for the 10-minute budget
Sequence length increased to 2048 for richer context
Always-decaying warmdown schedule to tighten weights and reduce quantization penalty
Test-time training with batched LoRA adapters on Q, V projections and LM head
Overtone spectral embedding initialization with phase-transition residual mixing
Int8 per-row quantization combined with zlib compression
FP16 tied embeddings to reduce quantization error compounding