PR #533

closed

GPTQ + Short TTT — val_bpb 1.1207 (seed 1337)

by newjordanView on GitHub
val_bpb
1.1207
Architecture
Transformer
Optimizer
SGD
Artifact Size
15.60 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: all
Architecture
XSA
Uses XSA in the last 4 layers as part of the custom transformer architecture.
parameters: {"layers":4}
SmearGate
Custom gating mechanism used in the MLP blocks.
parameters: null
BigramHash
BigramHash feature with 2048 buckets.
parameters: {"buckets":2048}
Partial RoPE
Applies rotary positional embeddings partially with a 16/64 setting.
parameters: {"numerator":16,"denominator":64}
MLP3x
Three-times MLP expansion with relu² activation.
parameters: {"expansion":3}
Optimizer
SGD
weight_decay: null
momentum: null
other_params: {"lr":0.002}
Test-Time Training
SGD TTT
parameters: {"learning_rate":0.002,"epochs":3,"freeze_blocks":2,"max_train_chunks":50,"ema_decay":0}
Weight Averaging
EMA
parameters: {"decay":0.995}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
stride-based eval
parameters: {"stride":32}
Initialization
orthogonal init
Orthogonal weight initialization used in the base architecture.
Regularization
weight decay
parameters: {"value":0.04}

Novel Contributions

  • Short TTT with SGD, no EMA, and only 50 training chunks to avoid late-chunk degradation
  • Proper zstd-22 compression to reduce artifact size
  • Disabled int8_sensitive to stay within the 16MB artifact limit
  • Maintained the same GPTQ pipeline and base architecture while slightly improving val_bpb