PR #598

open

Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTT

by Christopher-Lee-McClendonView on GitHub

val_bpb

1.1334

Architecture

GEPA

Optimizer

Muon

Artifact Size

15.70 MB

Training Techniques

Quantization

mixed int6/int8

bits: null

scope: int6 per-row for attention projections and MLP weights; int8 per-tensor for layer norms, value embeddings, biases, embedding tables

Architecture

XSA

Cross-sequence attention on last 4 layers removing self-value bias via orthogonal projection

parameters: {"layers":4}

SmearGate

Learned token-mixing gate on input embeddings

parameters: null

BigramHash

Bigram hash embeddings with 2048 buckets and 128 dimensions for cheap bigram context

parameters: {"buckets":2048,"dimensions":128}

Partial RoPE

Partial rotary positional embeddings on 16 of 64 dims with YARN scaling

parameters: {"dims":16,"total_dims":64,"train_seq_len":1024}

MLP3x

3× expansion MLP with 1536 hidden units and ReLU² activation

parameters: {"hidden":1536,"activation":"ReLU²"}

U-Net skip connections

Residual skip connections across layer pairs

parameters: null

LN depth scaling

LayerNorm scale adjusted by 1/sqrt(layer+1) for stable deep training

parameters: null

Value embeddings

128-dimensional value embeddings on layers 9 and 10 with per-layer scale

parameters: {"layers":[9,10],"dimensions":128,"init_scale":0.1}

Late QAT

Quantization-aware training with GPTQ-lite clip search enabled at step 6476 when LR scale < 0.15

parameters: {"step_enabled":6476,"clip_candidates_per_row":5,"threshold":0.15}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Adam

weight_decay: 0.04

momentum: null

other_params: {"applied_to":"embeddings/scalars"}

Weight Averaging

EMA

parameters: {"decay":0.997,"frequency":"every step"}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500,"total_steps":7000,"type":"cosine anneal"}

Test-Time Training

score-first TTT

parameters: {"optimizer":"SGD","momentum":0.9,"learning_rate":0.002,"epochs_per_chunk":10,"chunk_size_tokens":32768,"stride":64,"frozen_blocks":2,"trainable_params":22301260,"total_params":27030108}

Evaluation

stride-based eval

parameters: {"stride":64}

Regularization

weight decay

parameters: {"value":0.04}

Other

other

Freezing first 2 blocks during TTT

parameters: {"frozen_blocks":2}

Novel Contributions

Extended training to 7000 steps with warmdown cosine anneal from step 3500 to 7000 for better convergence
Mixed int6/int8 quantization scheme combining int6 per-row GPTQ-lite quantization for large QAT-trained weights and int8 per-tensor scalar quantization for smaller sensitive tensors
GEPA architecture combining multiple techniques: ReLU² activation, Cross-sequence attention (XSA), Bigram hash embeddings, Partial RoPE with YARN scaling, U-Net skip connections, Value embeddings on deep layers, LN depth scaling, and Late QAT
Legal score-first test-time training (TTT) protocol using SGD with momentum for 10 epochs per 32K-token chunk with frozen first 2 blocks, yielding a −0.0142 BPB improvement
Achieving a 15.70 MB artifact size under the 16MB limit with 27M parameters using mixed quantization and zstd-22 compression