PR #644
openNon-record: 11L GEPA + 25k Steps + Pure Int6 + Legal TTT (val_bpb=1.0944) - unlimited compute category
by Christopher-Lee-McClendonView on GitHub
val_bpb
1.0944
Architecture
GEPA (11-layer Transformer variant)
Optimizer
Muon (matrix LR), Adam (scalar LR), SGD (TTT)
Artifact Size
13.83 MB
Training Techniques
Quantization
int6 per-row with GPTQ-lite clip search
bits: 6
scope: all model tensors including embeddings
Architecture
XSA
Cross-sequence attention on last 4 layers
parameters: {"layers":4}
SmearGate
Learned token-mixing gate on input embeddings
parameters: null
BigramHash
2048 buckets with 128-dim embeddings
parameters: {"buckets":2048,"embedding_dim":128}
Partial RoPE
Rotary positional embeddings on 16/64 dims with YARN scaling
parameters: {"dims":"16/64","train_seq":2048}
MLP3x
MLP with 3× expansion and ReLU² activation
parameters: {"expansion_factor":3,"hidden_dim":1536,"activation":"ReLU²"}
LN Scale
LayerNorm scale with 1/sqrt(layer+1) depth scaling
parameters: null
Tied Embeddings
Input and output embeddings are tied
parameters: null
Optimizer
Muon and Adam for training; SGD with momentum for TTT
weight_decay: 0.04
momentum: 0.9
other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"embed_lr":0.035,"decoder_lr_mult":2,"grad_clip":0.3,"ema_decay":0.997,"SGD_lr":0.002,"SGD_epochs_per_chunk":10,"SGD_chunk_size":32768,"SGD_stride":64,"SGD_frozen_blocks":2,"SGD_grad_clip":1}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Test-Time Training
score-first TTT
parameters: {"optimizer":"SGD","learning_rate":0.002,"momentum":0.9,"epochs_per_chunk":10,"chunk_size":32768,"stride":64,"frozen_blocks":2,"gradient_clip":1}
LR Schedule
cosine warmdown with linear warmup
parameters: {"warmup_steps":20,"peak_lr_steps":12000,"warmdown_steps":13000}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Regularization
weight decay and layerwise LN scale
parameters: {"weight_decay":0.04,"LN_scale":"1/sqrt(layer+1)"}
Novel Contributions
- Extended training to 25,000 steps with a 13,000-step cosine warmdown phase, demonstrating warmdown acceleration in BPP improvement.
- Confirmed a consistent scaling law where float base, TTT BPP, and artifact size all improve monotonically with training steps.
- Observed compression of TTT gain as float base improves, suggesting diminishing returns for test-time training on better-trained models.
- Applied pure int6 per-row quantization with 15-candidate GPTQ-lite clip search combined with zstd-22 compression to achieve the smallest artifact size in the series.
- Implemented legal score-first test-time training using SGD with momentum and freezing the first two blocks, achieving a −0.014 BPP gain.
- Introduced architecture modifications including cross-sequence attention on last 4 layers, SmearGate token-mixing gate, BigramHash embeddings, partial RoPE with YARN scaling, and layerwise LN scale.
- Demonstrated that fine-grained optimization at low learning rates during warmdown is disproportionately effective.