PR #1376

open

Record: SLOT-24 + Pre-quant TTT — val_bpb 0.7094 (3-seed mean)

by stukenovView on GitHub

val_bpb

0.7094

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,930,472 bytes

Training Techniques

Test-Time Training

full TTT

parameters: {"epochs":6,"freeze_first_blocks":2,"learning_rate":0.0005}

score-first TTT

parameters: {"steps":24,"learning_rate_start":0.024,"learning_rate_min":0.001,"stride":96}

Quantization

GPTQ

bits: 6

scope: all

GPTQ

bits: 6

scope: full Hessian

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":96}

Architecture

BigramHash

Bigram hash embedding used in the base architecture

parameters: {"vocab_size":1536,"dimension":128}

XSA

XSA enabled in all layers

parameters: {"layers":"all"}

GQA

Grouped query attention with 8 attention heads and 4 KV heads

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

LeakyReLU squared MLP activation

parameters: {"slope":0.5,"mlp_multiplier":3}

Partial RoPE

Partial rotary positional embedding

parameters: {"partial":"16/64"}

VE128

VE128 component in the architecture

parameters: null

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997}

Regularization

LN scale

parameters: null

Novel Contributions

Per-sample SLOT-24 optimization with frozen model weights
Pre-quant AdamW test-time training before GPTQ quantization
Stride-96 SLOT evaluation to reduce windows and fit more optimization within budget
Full Hessian GPTQ int6 with lzma compression
Record 3-seed mean val_bpb of 0.7094