PR #1364

open

Record: Pre-quant AdamW TTT + QK-Gain 4.0 — val_bpb 1.1025 (3-seed mean)

by stukenovView on GitHub

val_bpb

1.1025

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,985,137 bytes

Training Techniques

Test-Time Training

full TTT

parameters: {"epochs":6,"freeze_first_blocks":2,"learning_rate_start":0.0005,"learning_rate_end":0.00005}

Optimizer

AdamW

weight_decay: 0.04

momentum: null

other_params: {"used_for":"TTT"}

Quantization

GPTQ

bits: 6

scope: all

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Weight Averaging

EMA

parameters: {"decay":0.997}

Tight SWA

parameters: {"every_steps":50}

Architecture

QK-Gain

Per-head QK gain scaling

parameters: {"gain":4}

XSA

XSA applied to all layers

parameters: {"layers":11}

LeakyReLU

LeakyReLU squared MLP activation

parameters: {"mlp_multiplier":3}

GQA

Grouped query attention

parameters: {"layers":11,"heads":8,"kv_heads":4}

RoPE

Partial rotary positional embedding

parameters: {"dimensions":16,"total_dimensions":64}

VE128

VE128 used in selected layers

parameters: {"layers":[9,10]}

SmearGate

SmearGate with BigramHash embedding

parameters: {"embedding_size":"2048x128"}

BigramHash

Bigram hash embedding

parameters: {"embedding_size":"2048x128"}

Regularization

LN scale

parameters: {"rule":"1/sqrt(layer+1)"}

Novel Contributions

Pre-quantization AdamW TTT on the full-precision EMA model before GPTQ
TTT-adapted weights quantize cleanly after full Hessian GPTQ
Record val_bpb of 1.1025 with a 3-seed mean
Combination of pre-quant TTT with QK-Gain 4.0 and full Hessian GPTQ