PR #1876

open

Non-record: Coprime-Stride Loader + Full GPTQ + Score-First TTT (3-seed mean 1.08008 BPB)

by Meirzhan05View on GitHub
val_bpb
1.0801
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB

Training Techniques

Architecture
depth recurrence
Layers 3-5 are looped 3x to create virtual depth.
parameters: {"layers":[3,5],"loops":3}
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
XSA
XSA applied on all layers.
parameters: {"layers":11}
U-Net skip connections
U-Net style skip connections with learnable gates.
parameters: null
weight tying
Tied embeddings.
parameters: null
LeakyReLU
MLP uses LeakyReLU(0.5)^2 activation.
parameters: {"slope":0.5}
Quantization
GPTQ
bits: null
scope: all
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"newton_schulz_steps":5}
AdamW
weight_decay: null
momentum: null
other_params: {"scope":"embeddings/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Compression
lzma
level: null
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":3,"chunk_size":32000}
Evaluation
sliding window eval
parameters: null
LR Schedule
warmdown
parameters: {"warmdown":0.72}
linear warmup
parameters: {"steps":20}
Regularization
logit softcap
parameters: {"value":30}

Novel Contributions

  • Coprime-stride multi-shard loader with progress-based adaptive shard selection
  • Full Hessian GPTQ with Cholesky fallback for ill-conditioned matrices
  • LZMA-compressed self-extracting Python submission artifact
  • Score-first test-time training that scores each chunk before updating