PR #734

open

Non-record: Full GPTQ + XSA-4 + Score-First TTT (3-seed mean 1.1198)

by Robby955View on GitHub

val_bpb

1.1198

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.9 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all weights

Architecture

XSA

Cross-sequence attention applied to the last 4 transformer layers to extend context at evaluation time.

parameters: {"layers":4}

BigramHash

Hash-based token feature component with 3072 buckets and 128-dimensional embeddings.

parameters: {"buckets":3072,"dimensions":128}

Partial RoPE

Rotary positional embeddings applied partially across dimensions.

parameters: {"dimensions":"16/64"}

MLP3x

Three-times wider MLP with LeakyReLU(0.5)^2 activation.

parameters: {"multiplier":3}

KV head count

Grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_interval_steps":50,"blend_ratio":"50/50"}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.0001,"epochs":3,"freeze_blocks":9,"chunk_tokens":131072}

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":4000}

Regularization

weight decay

parameters: {"muon_wd":0.04,"adamw_wd":0.04}

Other

other

Full Hessian GPTQ calibration using 256-batch training-data calibration, Cholesky error compensation, act-order, and block size 128.

parameters: {"calibration_batches":256,"block_size":128}

Novel Contributions

Full Hessian GPTQ with 256-batch calibration, Cholesky error compensation, act-order, and block_size=128
XSA on the last 4 layers for extended-context evaluation
SWA/EMA 50/50 blended weight averaging
Legal score-first test-time training protocol
LZMA compression for int6 weights