PR #734

open

Non-record: Full GPTQ + XSA-4 + Score-First TTT (3-seed mean 1.1198)

by Robby955View on GitHub
val_bpb
1.1198
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.9 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: all weights
Architecture
XSA
Cross-sequence attention applied to the last 4 transformer layers to extend context at evaluation time.
parameters: {"layers":4}
BigramHash
Hash-based token feature component with 3072 buckets and 128-dimensional embeddings.
parameters: {"buckets":3072,"dimensions":128}
Partial RoPE
Rotary positional embeddings applied partially across dimensions.
parameters: {"dimensions":"16/64"}
MLP3x
Three-times wider MLP with LeakyReLU(0.5)^2 activation.
parameters: {"multiplier":3}
KV head count
Grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_interval_steps":50,"blend_ratio":"50/50"}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.0001,"epochs":3,"freeze_blocks":9,"chunk_tokens":131072}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":4000}
Regularization
weight decay
parameters: {"muon_wd":0.04,"adamw_wd":0.04}
Other
other
Full Hessian GPTQ calibration using 256-batch training-data calibration, Cholesky error compensation, act-order, and block size 128.
parameters: {"calibration_batches":256,"block_size":128}

Novel Contributions

  • Full Hessian GPTQ with 256-batch calibration, Cholesky error compensation, act-order, and block_size=128
  • XSA on the last 4 layers for extended-context evaluation
  • SWA/EMA 50/50 blended weight averaging
  • Legal score-first test-time training protocol
  • LZMA compression for int6 weights