PR #668

open

Non-record: 11L GEPA + 30k Steps + Pure Int6 + Legal TTT (val_bpb=1.0920)

by Christopher-Lee-McClendonView on GitHub

val_bpb

1.0920

Architecture

Transformer

Optimizer

Muon

Artifact Size

13.40 MB

Training Techniques

Architecture

GEPA

11-layer transformer architecture with GEPA-related modifications

parameters: {"layers":11}

BigramHash

BigramHash embeddings used in the model

parameters: {"size":2048,"dim":128}

Partial RoPE

Rotary positional embeddings applied partially

parameters: {"dimensions":16}

SmearGate

SmearGate activation/gating mechanism

parameters: null

weight tying

Tied embeddings / decoder weight sharing implied by tied embed LR setting

parameters: null

Quantization

GPTQ-lite

bits: 6

scope: all weights including embeddings

int6

bits: 6

scope: per-row, including embeddings

Optimizer

Muon

weight_decay: 0.04

momentum: 0.92

other_params: {"momentum_warmup_end":0.99,"momentum_warmup_steps":1500,"lr_matrix":0.025,"lr_tied_embed":0.035,"decoder_lr_multiplier":2}

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: 22

Test-Time Training

score-first TTT

parameters: {"optimizer":"SGD","momentum":0.9,"learning_rate":0.002,"epochs":10,"tokens_per_chunk":32768,"freeze_first_blocks":2}

LR Schedule

warmdown

parameters: {"warmdown_steps":18000,"warmdown_ratio":0.6,"peak_lr_steps":12000,"warmup_steps":20}

Regularization

weight decay

parameters: {"value":0.04}

gradient clipping

parameters: {"clip_norm":0.3}

Initialization

OrthoInit

Referenced as part of prior techniques this submission builds on

Novel Contributions

11-layer GEPA architecture trained for 30k steps
Pure int6 per-row quantization with GPTQ-lite clip search
Legal score-first TTT using SGD with momentum
60% warmdown ratio to reduce quantization gap
Smallest artifact in the author's series at 13.40 MB
Includes model artifact for reproducibility