PR #644

open

Non-record: 11L GEPA + 25k Steps + Pure Int6 + Legal TTT (val_bpb=1.0944) - unlimited compute category

by Christopher-Lee-McClendonView on GitHub

val_bpb

1.0944

Architecture

GEPA (11-layer Transformer variant)

Optimizer

Muon (matrix LR), Adam (scalar LR), SGD (TTT)

Artifact Size

13.83 MB

Training Techniques

Quantization

int6 per-row with GPTQ-lite clip search

bits: 6

scope: all model tensors including embeddings

Architecture

XSA

Cross-sequence attention on last 4 layers

parameters: {"layers":4}

SmearGate

Learned token-mixing gate on input embeddings

parameters: null

BigramHash

2048 buckets with 128-dim embeddings

parameters: {"buckets":2048,"embedding_dim":128}

Partial RoPE

Rotary positional embeddings on 16/64 dims with YARN scaling

parameters: {"dims":"16/64","train_seq":2048}

MLP3x

MLP with 3× expansion and ReLU² activation

parameters: {"expansion_factor":3,"hidden_dim":1536,"activation":"ReLU²"}

LN Scale

LayerNorm scale with 1/sqrt(layer+1) depth scaling

parameters: null

Tied Embeddings

Input and output embeddings are tied

parameters: null

Optimizer

Muon and Adam for training; SGD with momentum for TTT

weight_decay: 0.04

momentum: 0.9

other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"embed_lr":0.035,"decoder_lr_mult":2,"grad_clip":0.3,"ema_decay":0.997,"SGD_lr":0.002,"SGD_epochs_per_chunk":10,"SGD_chunk_size":32768,"SGD_stride":64,"SGD_frozen_blocks":2,"SGD_grad_clip":1}

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: 22

Test-Time Training

score-first TTT

parameters: {"optimizer":"SGD","learning_rate":0.002,"momentum":0.9,"epochs_per_chunk":10,"chunk_size":32768,"stride":64,"frozen_blocks":2,"gradient_clip":1}

LR Schedule

cosine warmdown with linear warmup

parameters: {"warmup_steps":20,"peak_lr_steps":12000,"warmdown_steps":13000}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Regularization

weight decay and layerwise LN scale

parameters: {"weight_decay":0.04,"LN_scale":"1/sqrt(layer+1)"}

Novel Contributions

Extended training to 25,000 steps with a 13,000-step cosine warmdown phase, demonstrating warmdown acceleration in BPP improvement.
Confirmed a consistent scaling law where float base, TTT BPP, and artifact size all improve monotonically with training steps.
Observed compression of TTT gain as float base improves, suggesting diminishing returns for test-time training on better-trained models.
Applied pure int6 per-row quantization with 15-candidate GPTQ-lite clip search combined with zstd-22 compression to achieve the smallest artifact size in the series.
Implemented legal score-first test-time training using SGD with momentum and freezing the first two blocks, achieving a −0.014 BPP gain.
Introduced architecture modifications including cross-sequence attention on last 4 layers, SmearGate token-mixing gate, BigramHash embeddings, partial RoPE with YARN scaling, and layerwise LN scale.
Demonstrated that fine-grained optimization at low learning rates during warmdown is disproportionately effective.