PR #512

closed

Record: PROTEUS v7 — 11L INT6 + LoRA TTT (mean val_bpb=0.9512, 3 seeds)

by MatoTeziTankaView on GitHub

val_bpb

0.9512

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.4 MB

Training Techniques

Quantization

int6

bits: 6

scope: all weight matrices

Architecture

SmearGate

Added SmearGate as part of the model architecture.

parameters: null

BigramHash

Uses BigramHash features in the model.

parameters: {"dimensions":128,"hash_size":2048}

MLP3x

Uses a 3x expansion MLP with relu² activation.

parameters: {"hidden_size":1536}

RoPE

Uses RoPE with NTK-aware evaluation scaling.

parameters: {"base":50000}

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.02}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"used_for":"embeddings/scalars"}

Weight Averaging

SWA

parameters: {"checkpoints":11,"last_fraction":0.2}

Compression

zstd

level: 22

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.01,"batch_size":64,"epochs":3,"chunk_size":256,"min_doc_len":512,"scope":"Q + V projections + LM head","per_document":true,"multi_epoch":true,"backward_looking":true}

Initialization

OrthoInit

Orthogonal initialization used for model components.

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Regularization

weight decay

parameters: {"value":0.04}

gradient clipping

parameters: {"clip_norm":0.3}

Other

other

Depth-scaled residual connections with attenuation 1/sqrt(layer_idx + 1) for stability.

parameters: {"layers":11}

other

Fresh model copy used for TTT evaluation to avoid torch.compile graph caching.

parameters: null

Novel Contributions

INT6 uniform quantization for all weight matrices with low quantization gap
Depth-scaled residual connections for 11-layer stability
Backward-looking LoRA test-time training with per-document adaptation
Fresh model copy during TTT evaluation to avoid torch.compile graph caching
Multi-epoch TTT with scoring on the final pass
Skipping TTT for short documents under 512 tokens