PR #589

closed

Record: Late Soft-Round QAT + Score-First Backward-Looking TTT — val_bpb 1.1178

by RoyiRaView on GitHub

val_bpb

1.1178

Architecture

Transformer

Optimizer

SGD

Artifact Size

~15.75 MB

Training Techniques

Quantization

QAT

bits: 6

scope: all

Architecture

MLP3x

Three-layer MLP stack using LeakyReLU(0.5)^2 activation.

parameters: {"layers":3}

BigramHash

BigramHash component used in the model stack.

parameters: {"size":3072}

XSA

XSA applied to the last 4 layers.

parameters: {"layers":4}

RoPE

Partial rotary positional embeddings.

parameters: {"dimensions":16,"total_dimensions":64}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"frequency":50,"description":"tight SWA every 50 steps"}

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: {"stride":64,"seq_len":2048}

Test-Time Training

score-first TTT

parameters: {"chunk_size":32768,"optimizer":"SGD","learning_rate":0.002,"momentum":0.9,"epochs":3,"grad_clip":1,"frozen_blocks":null}

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.002,"cosine_decay":true,"grad_clip":1}

LR Schedule

cosine decay

parameters: {"learning_rate":0.002,"applied_to":"TTT across chunks"}

Other

other

Late soft-round QAT using a temperature-controlled sigmoid-interpolated surrogate in the backward pass while keeping hard quantized forward pass.

parameters: {"tau":0.1,"warmdown_scale_threshold":0.02}

Novel Contributions

Late Soft-Round QAT
Score-First Backward-Looking TTT
Temperature-controlled soft-round surrogate for bin-aware gradients near quantization boundaries
Backward-looking chunk-wise test-time training where each chunk is scored before being trained on