PR #598

open

Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTT

by Christopher-Lee-McClendonView on GitHub
val_bpb
1.1334
Architecture
GEPA
Optimizer
Muon
Artifact Size
15.70 MB

Training Techniques

Quantization
mixed int6/int8
bits: null
scope: int6 per-row for attention projections and MLP weights; int8 per-tensor for layer norms, value embeddings, biases, embedding tables
Architecture
XSA
Cross-sequence attention on last 4 layers removing self-value bias via orthogonal projection
parameters: {"layers":4}
SmearGate
Learned token-mixing gate on input embeddings
parameters: null
BigramHash
Bigram hash embeddings with 2048 buckets and 128 dimensions for cheap bigram context
parameters: {"buckets":2048,"dimensions":128}
Partial RoPE
Partial rotary positional embeddings on 16 of 64 dims with YARN scaling
parameters: {"dims":16,"total_dims":64,"train_seq_len":1024}
MLP3x
3× expansion MLP with 1536 hidden units and ReLU² activation
parameters: {"hidden":1536,"activation":"ReLU²"}
U-Net skip connections
Residual skip connections across layer pairs
parameters: null
LN depth scaling
LayerNorm scale adjusted by 1/sqrt(layer+1) for stable deep training
parameters: null
Value embeddings
128-dimensional value embeddings on layers 9 and 10 with per-layer scale
parameters: {"layers":[9,10],"dimensions":128,"init_scale":0.1}
Late QAT
Quantization-aware training with GPTQ-lite clip search enabled at step 6476 when LR scale < 0.15
parameters: {"step_enabled":6476,"clip_candidates_per_row":5,"threshold":0.15}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Adam
weight_decay: 0.04
momentum: null
other_params: {"applied_to":"embeddings/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.997,"frequency":"every step"}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500,"total_steps":7000,"type":"cosine anneal"}
Test-Time Training
score-first TTT
parameters: {"optimizer":"SGD","momentum":0.9,"learning_rate":0.002,"epochs_per_chunk":10,"chunk_size_tokens":32768,"stride":64,"frozen_blocks":2,"trainable_params":22301260,"total_params":27030108}
Evaluation
stride-based eval
parameters: {"stride":64}
Regularization
weight decay
parameters: {"value":0.04}
Other
other
Freezing first 2 blocks during TTT
parameters: {"frozen_blocks":2}

Novel Contributions

  • Extended training to 7000 steps with warmdown cosine anneal from step 3500 to 7000 for better convergence
  • Mixed int6/int8 quantization scheme combining int6 per-row GPTQ-lite quantization for large QAT-trained weights and int8 per-tensor scalar quantization for smaller sensitive tensors
  • GEPA architecture combining multiple techniques: ReLU² activation, Cross-sequence attention (XSA), Bigram hash embeddings, Partial RoPE with YARN scaling, U-Net skip connections, Value embeddings on deep layers, LN depth scaling, and Late QAT
  • Legal score-first test-time training (TTT) protocol using SGD with momentum for 10 epochs per 32K-token chunk with frozen first 2 blocks, yielding a −0.0142 BPB improvement
  • Achieving a 15.70 MB artifact size under the 16MB limit with 27M parameters using mixed quantization and zstd-22 compression