PR #612

open

Non-record: 11L GEPA + 12k Steps + Pure Int6 + Legal TTT (val_bpb=1.1079)

by Christopher-Lee-McClendonView on GitHub

val_bpb

1.1079

Architecture

GEPA

Optimizer

SGD

Artifact Size

14.79 MB

Training Techniques

Quantization

int6 per-row with GPTQ-lite

bits: 6

scope: all

Architecture

XSA

Cross-sequence attention on last 4 layers

parameters: null

SmearGate

Learned token-mixing gate on input embeddings

parameters: null

BigramHash

2048 buckets, 128-dim embeddings

parameters: {"buckets":2048,"embedding_dim":128}

Partial RoPE

Partial rotary positional embeddings with YARN scaling

parameters: {"dims":"16/64","train_seq":2048}

MLP3x

3× expansion with ReLU² activation

parameters: {"hidden_dim":1536}

tied embeddings

Tied input and output embeddings

parameters: null

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.002,"epochs_per_chunk":10,"gradient_clip":1,"freeze_first_blocks":2}

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: 22

Test-Time Training

score-first TTT

parameters: {"optimizer":"SGD","learning_rate":0.002,"momentum":0.9,"epochs_per_chunk":10,"chunk_size_tokens":32768,"stride_tokens":64,"frozen_blocks":2,"gradient_clip":1,"total_chunks":1893}

LR Schedule

cosine decay with linear warmup

parameters: {"warmup_steps":20,"warmdown_start_step":7000,"total_steps":12000}

Regularization

weight decay

parameters: {"value":0.04}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Novel Contributions

12k-step training with 5k-step warmdown exploiting unlimited-compute track
Pure int6 per-row quantization with 15-candidate GPTQ-lite clip search
Legal score-first test-time training (TTT) with SGD momentum and learning rate warmup