PR #1004

open

Non-record: 33.6M Int5 GPTQ + Legal s_0-only TTT (val_bpb=1.1182)

by ibarrajoView on GitHub
val_bpb
1.1182
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,535,414 bytes

Training Techniques

Quantization
GPTQ
bits: 5
scope: all
Architecture
BigramHash
Bigram hash feature embedding
parameters: {"dimensions":8192}
XSA
XSA applied across all layers
parameters: {"layers":11}
EMA
Exponential moving average of weights
parameters: {"decay":0.997}
U-Net skip connections
U-Net style skip connections in the model
parameters: {"layers":11}
SmearGate
SmearGate activation/component
parameters: null
Partial RoPE
Partial rotary positional embeddings
parameters: {"dimensions":16}
Regularization
magnitude pruning
parameters: {"pruning_rate":0.05}
LN scale
parameters: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"chunk_tokens":131072}
Compression
zstd
level: 22

Novel Contributions

  • 33.6M parameter Transformer with int5 GPTQ compression
  • Legal score-first TTT that reports only cumulative s_0 score
  • Removal of illegal post-TTT re-evaluation and temperature calibration
  • 5% magnitude pruning to keep artifact under 16MB
  • Sliding window evaluation with stride 64
  • BigramHash, XSA-all, SmearGate, Partial RoPE, and EMA-based architecture/features