PR #600
openNon-record: TTT-LoRA Base — HumanAI Convention (val_bpb=1.2364)
by humanaiconventionView on GitHub
val_bpb
1.2364
Architecture
Transformer
Optimizer
Adam
Artifact Size
15.7MB
Training Techniques
Test-Time Training
LoRA TTT
parameters: {"rank":128,"learning_rate":null,"chunk_size":64,"adam_steps_per_chunk":4,"batch_size":64,"eval_cap_seconds":480}
Architecture
SmearGate
Learnable residual mixing gate in each transformer block allowing interpolation between full-residual and full-hidden-state
parameters: null
Orthogonal initialisation
All matrix parameters initialised orthogonally to improve gradient flow and training stability
parameters: null
Bigram hash embeddings
2048-bucket bigram hash table added to token embeddings providing cheap local context without extra counted parameters
parameters: {"buckets":2048}
GQA (Grouped-Query Attention)
8 query heads and 4 KV heads to reduce KV cache and allow higher batch throughput during TTT evaluation
parameters: {"query_heads":8,"kv_heads":4}
Weight Averaging
SWA
parameters: {"steps":5065,"decay":0.4}
Quantization
QAT int6
bits: 6
scope: all
Compression
zstd
level: 22
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Optimizer
Adam
weight_decay: 0.04
momentum: null
other_params: {"matrix_lr":0.04,"scalar_lr":0.04,"embed_lr":0.05,"muon_weight_decay":0.04}
Evaluation
stride-based eval
parameters: {"stride":512}
Novel Contributions
- Per-document Test-Time Training (TTT) via LoRA adapters trained during evaluation time
- Use of fresh rank-128 LoRA adapters per validation document trained on preceding chunks before next chunk prediction
- Exploitation of separate evaluation budget for adaptation, orthogonal to all current leaderboard entries