PR #628

open

Non-record: 11L GEPA + 20k Steps + Pure Int6 + Legal TTT (val_bpb=1.0983): unlimited compute: 4×A100-40GB, ~2.8 hours

by Christopher-Lee-McClendonView on GitHub

val_bpb

1.0983

Architecture

11-layer GEPA Transformer variant

Optimizer

SGD with momentum

Artifact Size

14.29 MB

Training Techniques

Quantization

int6 per-row with GPTQ-lite clip search

bits: 6

scope: all model tensors including embeddings

Architecture

XSA

Cross-sequence attention on last 4 layers

parameters: null

SmearGate

Learned token-mixing gate on input embeddings

parameters: null

BigramHash

2048 buckets with 128-dim embeddings

parameters: {"buckets":2048,"embedding_dim":128}

RoPE

Partial RoPE with 16/64 dims and YARN scaling

parameters: {"partial_dims":"16/64","train_seq_length":2048}

MLP

3× expansion with ReLU² activation

parameters: {"expansion_factor":3,"hidden_dim":1536,"activation":"ReLU²"}

Value Embeddings

128d on layers 9–10 with per-layer scale initialized at 0.1

parameters: {"dimension":128,"layers":[9,10],"init_scale":0.1}

LN Scale

LayerNorm scale with 1/sqrt(layer+1) depth scaling

parameters: null

U-Net skips

Residual connections across layer pairs

parameters: null

Tied Embeddings

Weight tying of embeddings

parameters: null

Optimizer

SGD

weight_decay: 0.04

momentum: 0.9

other_params: {"learning_rate":0.002,"lr_schedule":"cosine decay with 5% warmup"}

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"chunk_size":32768,"epochs_per_chunk":10}

Test-Time Training

score-first TTT

parameters: {"optimizer":"SGD","learning_rate":0.002,"momentum":0.9,"epochs_per_chunk":10,"chunk_size":32768,"stride":64,"frozen_blocks":2,"gradient_clip":1,"lr_warmup_percent":5}

LR Schedule

warmdown

parameters: {"warmdown_start_step":12000,"warmdown_steps":8000,"type":"cosine anneal"}

Regularization

weight decay

parameters: {"weight_decay":0.04}

freeze early layers during TTT

parameters: {"frozen_blocks":2,"total_blocks":11}

Novel Contributions

Demonstrated that warmdown is a first-class training variable delivering the majority of gains after peak LR plateau, with an 8000-step warmdown driving float base BPP from ~1.216 to 1.1153.
Achieved smallest artifact size (14.29 MB) with pure int6 per-row quantization combined with GPTQ-lite clip search over 15 percentile candidates and zstd-22 compression.
Showed that SGD with momentum outperforms AdamW for legal score-first test-time training (TTT), delivering 2.4× the TTT gain on the same base model.
Identified freezing early layers during TTT as active regularization improving adaptation, not just catastrophic forgetting defense.
Found that as base model quality improves, the relative contribution of TTT to final gain shrinks, emphasizing investing in base model training after choosing the right TTT regime.