PR #568

closed

Record: PROTEUS v8 — 11L INT6 + LoRA TTT 5ep cosine (mean val_bpb=0.7853, 4 seeds)

by MatoTeziTankaView on GitHub

val_bpb

0.7853

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.4-16.2 MB

Training Techniques

Quantization

int6

bits: 6

scope: all weight matrices

Architecture

SmearGate

Custom gating component in the transformer architecture

parameters: null

BigramHash

BigramHash feature module used in the model

parameters: {"size":2048,"dim":128}

MLP3x

MLP with 3x expansion

parameters: {"hidden_size":1536}

RoPE

Rotary positional embeddings with NTK-aware eval scaling

parameters: {"base":50000}

tied embeddings

Input and output embeddings are tied

parameters: null

KV head count

Grouped-query attention with fewer KV heads than attention heads

parameters: {"heads":8,"kv_heads":4}

depth-scaled residual

Residual scaling by inverse square root of layer index plus one

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.02}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"used_for":"embeddings/scalars"}

Weight Averaging

SWA

parameters: {"checkpoints":11,"during_last_20_percent_of_warmdown":true}

Compression

zstd

level: 22

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.01,"epochs":5,"schedule":"cosine decay","targets":["Q","V","LM head"]}

Evaluation

score every epoch

parameters: {"last_epoch_kept":true,"sequential_chunk_evaluation":true}

LR Schedule

cosine decay

parameters: {"start_lr":0.01,"end_lr":0.001,"epochs":5}

Regularization

weight decay

parameters: {"value":0.04}

gradient clipping

parameters: {"value":0.3}

Other

other

Magnitude pruning used to fit artifact size constraints

parameters: {"prune_percent":[3,5]}

Novel Contributions

Improved TTT evaluation strategy by scoring every token before training on it in every epoch
Extended TTT from 3 epochs to 5 epochs
Switched TTT learning rate schedule from flat to cosine decay
Used LoRA-based test-time training on Q, V, and LM head
Included multiple seeds and a rerun with higher pruning to satisfy the 16MB artifact limit