PR #668
openNon-record: 11L GEPA + 30k Steps + Pure Int6 + Legal TTT (val_bpb=1.0920)
by Christopher-Lee-McClendonView on GitHub
val_bpb
1.0920
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.40 MB
Training Techniques
Architecture
GEPA
11-layer transformer architecture with GEPA-related modifications
parameters: {"layers":11}
BigramHash
BigramHash embeddings used in the model
parameters: {"size":2048,"dim":128}
Partial RoPE
Rotary positional embeddings applied partially
parameters: {"dimensions":16}
SmearGate
SmearGate activation/gating mechanism
parameters: null
weight tying
Tied embeddings / decoder weight sharing implied by tied embed LR setting
parameters: null
Quantization
GPTQ-lite
bits: 6
scope: all weights including embeddings
int6
bits: 6
scope: per-row, including embeddings
Optimizer
Muon
weight_decay: 0.04
momentum: 0.92
other_params: {"momentum_warmup_end":0.99,"momentum_warmup_steps":1500,"lr_matrix":0.025,"lr_tied_embed":0.035,"decoder_lr_multiplier":2}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Test-Time Training
score-first TTT
parameters: {"optimizer":"SGD","momentum":0.9,"learning_rate":0.002,"epochs":10,"tokens_per_chunk":32768,"freeze_first_blocks":2}
LR Schedule
warmdown
parameters: {"warmdown_steps":18000,"warmdown_ratio":0.6,"peak_lr_steps":12000,"warmup_steps":20}
Regularization
weight decay
parameters: {"value":0.04}
gradient clipping
parameters: {"clip_norm":0.3}
Initialization
OrthoInit
Referenced as part of prior techniques this submission builds on
Novel Contributions
- 11-layer GEPA architecture trained for 30k steps
- Pure int6 per-row quantization with GPTQ-lite clip search
- Legal score-first TTT using SGD with momentum
- 60% warmdown ratio to reduce quantization gap
- Smallest artifact in the author's series at 13.40 MB
- Includes model artifact for reproducibility