PR #628
openNon-record: 11L GEPA + 20k Steps + Pure Int6 + Legal TTT (val_bpb=1.0983): unlimited compute: 4×A100-40GB, ~2.8 hours
by Christopher-Lee-McClendonView on GitHub
val_bpb
1.0983
Architecture
11-layer GEPA Transformer variant
Optimizer
SGD with momentum
Artifact Size
14.29 MB
Training Techniques
Quantization
int6 per-row with GPTQ-lite clip search
bits: 6
scope: all model tensors including embeddings
Architecture
XSA
Cross-sequence attention on last 4 layers
parameters: null
SmearGate
Learned token-mixing gate on input embeddings
parameters: null
BigramHash
2048 buckets with 128-dim embeddings
parameters: {"buckets":2048,"embedding_dim":128}
RoPE
Partial RoPE with 16/64 dims and YARN scaling
parameters: {"partial_dims":"16/64","train_seq_length":2048}
MLP
3× expansion with ReLU² activation
parameters: {"expansion_factor":3,"hidden_dim":1536,"activation":"ReLU²"}
Value Embeddings
128d on layers 9–10 with per-layer scale initialized at 0.1
parameters: {"dimension":128,"layers":[9,10],"init_scale":0.1}
LN Scale
LayerNorm scale with 1/sqrt(layer+1) depth scaling
parameters: null
U-Net skips
Residual connections across layer pairs
parameters: null
Tied Embeddings
Weight tying of embeddings
parameters: null
Optimizer
SGD
weight_decay: 0.04
momentum: 0.9
other_params: {"learning_rate":0.002,"lr_schedule":"cosine decay with 5% warmup"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"chunk_size":32768,"epochs_per_chunk":10}
Test-Time Training
score-first TTT
parameters: {"optimizer":"SGD","learning_rate":0.002,"momentum":0.9,"epochs_per_chunk":10,"chunk_size":32768,"stride":64,"frozen_blocks":2,"gradient_clip":1,"lr_warmup_percent":5}
LR Schedule
warmdown
parameters: {"warmdown_start_step":12000,"warmdown_steps":8000,"type":"cosine anneal"}
Regularization
weight decay
parameters: {"weight_decay":0.04}
freeze early layers during TTT
parameters: {"frozen_blocks":2,"total_blocks":11}
Novel Contributions
- Demonstrated that warmdown is a first-class training variable delivering the majority of gains after peak LR plateau, with an 8000-step warmdown driving float base BPP from ~1.216 to 1.1153.
- Achieved smallest artifact size (14.29 MB) with pure int6 per-row quantization combined with GPTQ-lite clip search over 15 percentile candidates and zstd-22 compression.
- Showed that SGD with momentum outperforms AdamW for legal score-first test-time training (TTT), delivering 2.4× the TTT gain on the same base model.
- Identified freezing early layers during TTT as active regularization improving adaptation, not just catastrophic forgetting defense.
- Found that as base model quality improves, the relative contribution of TTT to final gain shrinks, emphasizing investing in base model training after choosing the right TTT regime.