PR #1410

open

Record: 11L LatentMask TTT + GPTQ + Product-Key Bigram + Brotli — val_bpb 1.1158 (3-seed mean)

by izlleyView on GitHub

val_bpb

1.1158

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,989,386 bytes

Training Techniques

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.0008,"chunk_size":65536,"epochs":4,"momentum":0.9}

Quantization

GPTQ

bits: 6

scope: MLP/attention weights

int8

bits: 8

scope: embeddings

Architecture

BigramHash

Product-key bigram embedding using factored previous/current embeddings with no hash collisions and no projection layer.

parameters: {"prev_dim":1024,"cur_dim":1024,"embed_dim":512}

Gated Attention

GatedAttention applied on alternating layers while standard attention is used on the remaining layers.

parameters: {"layers":[0,2,4,6,8,10]}

XSA

Exclusive Self-Attention used throughout the model.

parameters: {"layers":11}

U-Net skip connections

U-Net style encoder-decoder skip connections in the transformer.

parameters: null

SmearGate

Adjacent token mixing via SmearGate.

parameters: null

LeakyReLU

LeakyReLU squared MLP activation.

parameters: {"negative_slope":0.5,"squared":true}

weight tying

Tied input and output embeddings.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

logit softcap

parameters: {"value":30}

Compression

Brotli

level: 11

Other

other

LatentMask TTT with per-channel sigmoid masks and biases trained per chunk at evaluation time using a sign-based Muon-lite optimizer.

parameters: {"score_first":true}

Novel Contributions

LatentMask TTT with per-channel sigmoid masks and biases trained at evaluation time
Product-Key Bigram embedding replacing hash-based bigram embeddings
Alternating GatedAttention layers to reduce parameters while improving bpb
Brotli-11 custom serialization with uint8 log-scale quantization for artifact compression
Full Hessian GPTQ with Cholesky error compensation and column reordering