PR #612
openNon-record: 11L GEPA + 12k Steps + Pure Int6 + Legal TTT (val_bpb=1.1079)
by Christopher-Lee-McClendonView on GitHub
val_bpb
1.1079
Architecture
GEPA
Optimizer
SGD
Artifact Size
14.79 MB
Training Techniques
Quantization
int6 per-row with GPTQ-lite
bits: 6
scope: all
Architecture
XSA
Cross-sequence attention on last 4 layers
parameters: null
SmearGate
Learned token-mixing gate on input embeddings
parameters: null
BigramHash
2048 buckets, 128-dim embeddings
parameters: {"buckets":2048,"embedding_dim":128}
Partial RoPE
Partial rotary positional embeddings with YARN scaling
parameters: {"dims":"16/64","train_seq":2048}
MLP3x
3× expansion with ReLU² activation
parameters: {"hidden_dim":1536}
tied embeddings
Tied input and output embeddings
parameters: null
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.002,"epochs_per_chunk":10,"gradient_clip":1,"freeze_first_blocks":2}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Test-Time Training
score-first TTT
parameters: {"optimizer":"SGD","learning_rate":0.002,"momentum":0.9,"epochs_per_chunk":10,"chunk_size_tokens":32768,"stride_tokens":64,"frozen_blocks":2,"gradient_clip":1,"total_chunks":1893}
LR Schedule
cosine decay with linear warmup
parameters: {"warmup_steps":20,"warmdown_start_step":7000,"total_steps":12000}
Regularization
weight decay
parameters: {"value":0.04}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Novel Contributions
- 12k-step training with 5k-step warmdown exploiting unlimited-compute track
- Pure int6 per-row quantization with 15-candidate GPTQ-lite clip search
- Legal score-first test-time training (TTT) with SGD momentum and learning rate warmup