PR #1319

open

Record: 11L LeakyReLU² XSA-all GPTQ-AR SLOT64 — 0.6951 BPB

by canivelView on GitHub

val_bpb

0.6951

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.69 MB

Training Techniques

Architecture

LeakyReLU

LeakyReLU squared MLP activation used in the feedforward network.

parameters: {"layers":11,"model_dim":512,"mlp_expansion":3}

XSA

Exclusive Self Attention applied to all layers.

parameters: {"layers":11}

BigramHash

Bigram hash embedding used as an auxiliary architectural component.

parameters: {"buckets":3072,"dimensions":112}

U-Net skip connections

Encoder-decoder style skip connections in the model.

parameters: {"encoder_layers":5,"decoder_layers":6}

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Quantization

GPTQ

bits: 6

scope: MLP+attention

int8

bits: 8

scope: embeddings

late QAT

bits: null

scope: all

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.92

other_params: {"lr":0.025,"momentum_schedule_end":0.99,"momentum_schedule_steps":1500}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"embed_lr":0.035,"scalar_lr":0.025}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_interval":50}

Compression

lzma

level: 9

Evaluation

sliding window eval

parameters: {"stride":64,"seq_len":2048}

Test-Time Training

score-first TTT

parameters: {"steps":64,"warmstart_alpha":0.85,"learning_rate_start":0.01,"learning_rate_end":0.001}

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Novel Contributions

SLOT with warmstart across adjacent evaluation windows
Autoregressive self-generated GPTQ calibration tokens
Full Hessian GPTQ with Cholesky error compensation and Hessian-diagonal column reordering
Int6 MLP+attention quantization with Int8 embeddings under a 16 MB artifact budget