PR #633

open

PROTEUS v9 — 11L INT6 + single-epoch LoRA TTT (mean val_bpb=1.1526, 3 seeds)

by MatoTeziTankaView on GitHub

val_bpb

1.1526

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.4 MB

Training Techniques

Quantization

INT6 GPTQ-lite

bits: 6

scope: all

Architecture

XSA

Cross self-attention on last 4 layers

parameters: {"layers":4}

SmearGate

Custom gating mechanism

parameters: null

BigramHash

Bigram hashing with 2048 buckets and 128 dimension

parameters: {"buckets":2048,"dimension":128}

RoPE

Rotary positional embeddings with base 50K and NTK-aware eval scaling

parameters: {"base":50000}

depth-scaled residual

Residual scaling by 1/sqrt(layer_idx + 1) per block

parameters: null

weight tying

Tied embeddings

parameters: null

MLP3x

MLP with 3x expansion and relu² activation

parameters: {"hidden_dim":1536}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"applied_to":"embeddings/scalars"}

Weight Averaging

EMA

parameters: {"decay":0.997,"frequency":"every step"}

Compression

zstd

level: 22

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.01,"betas":[0.9,0.95],"batch_size":64,"min_document_length":512,"single_epoch":true,"targets":["Q projections","V projections","LM head"]}

LR Schedule

warmdown

parameters: {"warmdown_iterations":3000,"type":"wallclock-based"}

Regularization

weight decay

parameters: {"value":0.04}

gradient clipping

parameters: {"clip_value":0.3}

magnitude pruning

parameters: {"percentage":3}

Other

other

Score-then-train single-epoch test-time training (TTT) to avoid training on evaluation tokens

parameters: null

Novel Contributions

Single-epoch test-time training (TTT) with score-then-train pattern to comply with rules against multi-epoch TTT
Use of INT6 GPTQ-lite quantization with 5 clip percentiles per row and selection by lowest MSE
Combination of LoRA TTT targeting Q, V projections and LM head with single epoch scoring
Architecture modifications including SmearGate, BigramHash, RoPE with NTK-aware scaling, depth-scaled residuals, and U-Net skip connections
Use of Muon optimizer with matrix_lr and AdamW for embeddings/scalars
Artifact compression using zstd-22 achieving ~15.4 MB artifact size within 16MB budget