PR #533

closed

GPTQ + Short TTT — val_bpb 1.1207 (seed 1337)

by newjordanView on GitHub

val_bpb

1.1207

Architecture

Transformer

Optimizer

SGD

Artifact Size

15.60 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Architecture

XSA

Uses XSA in the last 4 layers as part of the custom transformer architecture.

parameters: {"layers":4}

SmearGate

Custom gating mechanism used in the MLP blocks.

parameters: null

BigramHash

BigramHash feature with 2048 buckets.

parameters: {"buckets":2048}

Partial RoPE

Applies rotary positional embeddings partially with a 16/64 setting.

parameters: {"numerator":16,"denominator":64}

MLP3x

Three-times MLP expansion with relu² activation.

parameters: {"expansion":3}

Optimizer

SGD

weight_decay: null

momentum: null

other_params: {"lr":0.002}

Test-Time Training

SGD TTT

parameters: {"learning_rate":0.002,"epochs":3,"freeze_blocks":2,"max_train_chunks":50,"ema_decay":0}

Weight Averaging

EMA

parameters: {"decay":0.995}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

stride-based eval

parameters: {"stride":32}

Initialization

orthogonal init

Orthogonal weight initialization used in the base architecture.

Regularization

weight decay

parameters: {"value":0.04}

Novel Contributions

Short TTT with SGD, no EMA, and only 50 training chunks to avoid late-chunk degradation
Proper zstd-22 compression to reduce artifact size
Disabled int8_sensitive to stay within the 16MB artifact limit
Maintained the same GPTQ pipeline and base architecture while slightly improving val_bpb