PR #1410

open

Record: 11L LatentMask TTT + GPTQ + Product-Key Bigram + Brotli — val_bpb 1.1158 (3-seed mean)

val_bpb
1.1158
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,989,386 bytes

Training Techniques

Test-Time Training
score-first TTT
parameters: {"learning_rate":0.0008,"chunk_size":65536,"epochs":4,"momentum":0.9}
Quantization
GPTQ
bits: 6
scope: MLP/attention weights
int8
bits: 8
scope: embeddings
Architecture
BigramHash
Product-key bigram embedding using factored previous/current embeddings with no hash collisions and no projection layer.
parameters: {"prev_dim":1024,"cur_dim":1024,"embed_dim":512}
Gated Attention
GatedAttention applied on alternating layers while standard attention is used on the remaining layers.
parameters: {"layers":[0,2,4,6,8,10]}
XSA
Exclusive Self-Attention used throughout the model.
parameters: {"layers":11}
U-Net skip connections
U-Net style encoder-decoder skip connections in the transformer.
parameters: null
SmearGate
Adjacent token mixing via SmearGate.
parameters: null
LeakyReLU
LeakyReLU squared MLP activation.
parameters: {"negative_slope":0.5,"squared":true}
weight tying
Tied input and output embeddings.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
logit softcap
parameters: {"value":30}
Compression
Brotli
level: 11
Other
other
LatentMask TTT with per-channel sigmoid masks and biases trained per chunk at evaluation time using a sign-based Muon-lite optimizer.
parameters: {"score_first":true}

Novel Contributions

  • LatentMask TTT with per-channel sigmoid masks and biases trained at evaluation time
  • Product-Key Bigram embedding replacing hash-based bigram embeddings
  • Alternating GatedAttention layers to reduce parameters while improving bpb
  • Brotli-11 custom serialization with uint8 log-scale quantization for artifact compression
  • Full Hessian GPTQ with Cholesky error compensation and column reordering