PR #600

open

Non-record: TTT-LoRA Base — HumanAI Convention (val_bpb=1.2364)

by humanaiconventionView on GitHub

val_bpb

1.2364

Architecture

Transformer

Optimizer

Adam

Artifact Size

15.7MB

Training Techniques

Test-Time Training

LoRA TTT

parameters: {"rank":128,"learning_rate":null,"chunk_size":64,"adam_steps_per_chunk":4,"batch_size":64,"eval_cap_seconds":480}

Architecture

SmearGate

Learnable residual mixing gate in each transformer block allowing interpolation between full-residual and full-hidden-state

parameters: null

Orthogonal initialisation

All matrix parameters initialised orthogonally to improve gradient flow and training stability

parameters: null

Bigram hash embeddings

2048-bucket bigram hash table added to token embeddings providing cheap local context without extra counted parameters

parameters: {"buckets":2048}

GQA (Grouped-Query Attention)

8 query heads and 4 KV heads to reduce KV cache and allow higher batch throughput during TTT evaluation

parameters: {"query_heads":8,"kv_heads":4}

Weight Averaging

SWA

parameters: {"steps":5065,"decay":0.4}

Quantization

QAT int6

bits: 6

scope: all

Compression

zstd

level: 22

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Optimizer

Adam

weight_decay: 0.04

momentum: null

other_params: {"matrix_lr":0.04,"scalar_lr":0.04,"embed_lr":0.05,"muon_weight_decay":0.04}

Evaluation

stride-based eval

parameters: {"stride":512}

Novel Contributions

Per-document Test-Time Training (TTT) via LoRA adapters trained during evaluation time
Use of fresh rank-128 LoRA adapters per validation document trained on preceding chunks before next chunk prediction
Exploitation of separate evaluation budget for adaptation, orthogonal to all current leaderboard entries