PR #1263

open

Record: 11L LeakyReLU² + XSA-all + QK-Gain 4.0 + Full GPTQ + SLOT — val_bpb 0.9354 (3-seed mean)

by xexyzView on GitHub

val_bpb

0.9354

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.8 MB

Training Techniques

Architecture

LeakyReLU

LeakyReLU(0.5) squared MLP activation with 3x expansion

parameters: {"negative_slope":0.5,"squared":true,"mlp_expansion":3}

SmearGate

Embedding augmentation using SmearGate

parameters: null

BigramHash

Embedding augmentation using BigramHash

parameters: null

XSA

Cross-sequence attention applied on all layers

parameters: {"layers":11}

GQA

Grouped query attention with 4 KV heads

parameters: {"heads":8,"kv_heads":4}

weight tying

Tied embeddings / tied output projection

parameters: null

RoPE

Rotary positional embeddings

parameters: null

U-Net skip connections

U-Net style skip connections with learned skip weights

parameters: null

logit softcap

Softcapped logits with scale 30.0

parameters: {"softcap":30}

QK-Gain

Attention gain initialization

parameters: {"init":4}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"adam_for_scalars_embeddings":true}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for_slot":true}

Weight Averaging

EMA + Tight SWA

parameters: {"decay":0.997,"swa_every":50,"swa_start_step":4600}

Quantization

GPTQ

bits: 6

scope: full model

late QAT

bits: 6

scope: full model

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"seq_len":2048}

Test-Time Training

score-first TTT

parameters: {"steps":16,"learning_rate":0.008,"min_learning_rate":0.0008}

Regularization

logit softcap

parameters: {"softcap":30}

Sequence Length

sequence_length

train_length: null

eval_length: 2048

Novel Contributions

LeakyReLU(0.5)^2 MLP with 3x expansion
XSA applied on all 11 layers
QK-Gain initialization at 4.0
Full GPTQ int6 quantization with zstd compression
SLOT evaluation with per-sample delta and logit bias optimization
Scored-position masking for SLOT loss
EMA plus Tight SWA training recipe