PR #1303

open

Record: SLOT + QK-Gain 4.0 + XSA-11 — val_bpb 0.9462 (3-seed mean)

by anthony-maioView on GitHub

val_bpb

0.9462

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.7-15.8 MB

Training Techniques

Architecture

QK-Gain

Per-head query scaling to improve performance.

parameters: {"version":4}

XSA

Expanded XSA applied across all layers.

parameters: {"layers":11}

BigramHash

Bigram hashing used in the model, with reduced hash size for artifact fit.

parameters: {"size":1024}

LeakyReLU

LeakyReLU-based MLP activation.

parameters: {"power":2}

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Partial RoPE

Partial rotary positional embeddings.

parameters: {"train_fraction":16,"total_fraction":64}

SmearGate

SmearGate component in the architecture.

parameters: null

U-Net skip connections

U-Net style skip connections in the transformer.

parameters: null

Test-Time Training

score-first TTT

parameters: {"steps":16,"learning_rate":0.008,"min_learning_rate":0.0008}

Evaluation

sliding window eval

parameters: {"stride":64}

Compression

lzma

level: null

Quantization

late QAT

bits: 6

scope: all

Weight Averaging

EMA + Tight SWA

parameters: {"decay":0.997}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Regularization

LN scale

parameters: null

Novel Contributions

SLOT-16 scored-position learned output tuning with per-sample hidden delta and logit bias
QK-Gain 4.0 per-head query scaling
XSA expanded to all 11 layers
Improved sliding-window baseline combined with test-time SLOT optimization
Artifact fitting via reduced BigramHash size and lzma compression