PR #512

closed

Record: PROTEUS v7 — 11L INT6 + LoRA TTT (mean val_bpb=0.9512, 3 seeds)

by MatoTeziTankaView on GitHub
val_bpb
0.9512
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.4 MB

Training Techniques

Quantization
int6
bits: 6
scope: all weight matrices
Architecture
SmearGate
Added SmearGate as part of the model architecture.
parameters: null
BigramHash
Uses BigramHash features in the model.
parameters: {"dimensions":128,"hash_size":2048}
MLP3x
Uses a 3x expansion MLP with relu² activation.
parameters: {"hidden_size":1536}
RoPE
Uses RoPE with NTK-aware evaluation scaling.
parameters: {"base":50000}
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.02}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
SWA
parameters: {"checkpoints":11,"last_fraction":0.2}
Compression
zstd
level: 22
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01,"batch_size":64,"epochs":3,"chunk_size":256,"min_doc_len":512,"scope":"Q + V projections + LM head","per_document":true,"multi_epoch":true,"backward_looking":true}
Initialization
OrthoInit
Orthogonal initialization used for model components.
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Regularization
weight decay
parameters: {"value":0.04}
gradient clipping
parameters: {"clip_norm":0.3}
Other
other
Depth-scaled residual connections with attenuation 1/sqrt(layer_idx + 1) for stability.
parameters: {"layers":11}
other
Fresh model copy used for TTT evaluation to avoid torch.compile graph caching.
parameters: null

Novel Contributions

  • INT6 uniform quantization for all weight matrices with low quantization gap
  • Depth-scaled residual connections for 11-layer stability
  • Backward-looking LoRA test-time training with per-document adaptation
  • Fresh model copy during TTT evaluation to avoid torch.compile graph caching
  • Multi-epoch TTT with scoring on the final pass
  • Skipping TTT for short documents under 512 tokens