PR #633
openPROTEUS v9 — 11L INT6 + single-epoch LoRA TTT (mean val_bpb=1.1526, 3 seeds)
by MatoTeziTankaView on GitHub
val_bpb
1.1526
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.4 MB
Training Techniques
Quantization
INT6 GPTQ-lite
bits: 6
scope: all
Architecture
XSA
Cross self-attention on last 4 layers
parameters: {"layers":4}
SmearGate
Custom gating mechanism
parameters: null
BigramHash
Bigram hashing with 2048 buckets and 128 dimension
parameters: {"buckets":2048,"dimension":128}
RoPE
Rotary positional embeddings with base 50K and NTK-aware eval scaling
parameters: {"base":50000}
depth-scaled residual
Residual scaling by 1/sqrt(layer_idx + 1) per block
parameters: null
weight tying
Tied embeddings
parameters: null
MLP3x
MLP with 3x expansion and relu² activation
parameters: {"hidden_dim":1536}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"applied_to":"embeddings/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.997,"frequency":"every step"}
Compression
zstd
level: 22
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01,"betas":[0.9,0.95],"batch_size":64,"min_document_length":512,"single_epoch":true,"targets":["Q projections","V projections","LM head"]}
LR Schedule
warmdown
parameters: {"warmdown_iterations":3000,"type":"wallclock-based"}
Regularization
weight decay
parameters: {"value":0.04}
gradient clipping
parameters: {"clip_value":0.3}
magnitude pruning
parameters: {"percentage":3}
Other
other
Score-then-train single-epoch test-time training (TTT) to avoid training on evaluation tokens
parameters: null
Novel Contributions
- Single-epoch test-time training (TTT) with score-then-train pattern to comply with rules against multi-epoch TTT
- Use of INT6 GPTQ-lite quantization with 5 clip percentiles per row and selection by lowest MSE
- Combination of LoRA TTT targeting Q, V projections and LM head with single epoch scoring
- Architecture modifications including SmearGate, BigramHash, RoPE with NTK-aware scaling, depth-scaled residuals, and U-Net skip connections
- Use of Muon optimizer with matrix_lr and AdamW for embeddings/scalars
- Artifact compression using zstd-22 achieving ~15.4 MB artifact size within 16MB budget