PR #568

closed

Record: PROTEUS v8 — 11L INT6 + LoRA TTT 5ep cosine (mean val_bpb=0.7853, 4 seeds)

by MatoTeziTankaView on GitHub
val_bpb
0.7853
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.4-16.2 MB

Training Techniques

Quantization
int6
bits: 6
scope: all weight matrices
Architecture
SmearGate
Custom gating component in the transformer architecture
parameters: null
BigramHash
BigramHash feature module used in the model
parameters: {"size":2048,"dim":128}
MLP3x
MLP with 3x expansion
parameters: {"hidden_size":1536}
RoPE
Rotary positional embeddings with NTK-aware eval scaling
parameters: {"base":50000}
tied embeddings
Input and output embeddings are tied
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
depth-scaled residual
Residual scaling by inverse square root of layer index plus one
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.02}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
SWA
parameters: {"checkpoints":11,"during_last_20_percent_of_warmdown":true}
Compression
zstd
level: 22
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01,"epochs":5,"schedule":"cosine decay","targets":["Q","V","LM head"]}
Evaluation
score every epoch
parameters: {"last_epoch_kept":true,"sequential_chunk_evaluation":true}
LR Schedule
cosine decay
parameters: {"start_lr":0.01,"end_lr":0.001,"epochs":5}
Regularization
weight decay
parameters: {"value":0.04}
gradient clipping
parameters: {"value":0.3}
Other
other
Magnitude pruning used to fit artifact size constraints
parameters: {"prune_percent":[3,5]}

Novel Contributions

  • Improved TTT evaluation strategy by scoring every token before training on it in every epoch
  • Extended TTT from 3 epochs to 5 epochs
  • Switched TTT learning rate schedule from flat to cosine decay
  • Used LoRA-based test-time training on Q, V, and LM head
  • Included multiple seeds and a rerun with higher pruning to satisfy the 16MB artifact limit