PR #1446

open

Non-record: 11L gated Krylov + AR GPTQ int6 + lzma, 1.09596 BPB

by LauraGomezjuradoView on GitHub

val_bpb

1.0960

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,925,099 bytes

Training Techniques

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"gated_krylov_correction":true,"alpha":0.05,"eta_threshold":0.03,"warmup_steps":1000,"decision_every":100,"every":2,"hutchinson_samples":2,"rank_max":4,"rank_scale":1}

Quantization

GPTQ

bits: 6

scope: all

Compression

lzma

level: 9

Architecture

XSA

XSA attention used across all layers

parameters: {"layers":11}

BigramHash

BigramHash embedding component

parameters: {"dimension":112,"size":3072}

SmearGate

Position-mixing gate

parameters: null

VE128

VE128 used in layers 9-10

parameters: {"layers":[9,10]}

Partial RoPE

RoPE applied to a subset of dimensions

parameters: {"dims":"16/64"}

U-Net skip connections

Encoder-decoder skip connections

parameters: null

LeakyReLU

LeakyReLU squared MLP activation

parameters: {"squared":true,"negative_slope":0.5}

MLP3x

Three-layer MLP

parameters: {"width":1536}

Weight Averaging

EMA + Tight SWA

parameters: {"decay":0.997,"swa_every":50}

Evaluation

sliding window eval

parameters: null

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_steps":4000}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

pruning

parameters: {"type":"selective ±1","criterion":"reconstruction error"}

Novel Contributions

Gated Krylov residual correction added selectively on top of Muon
Hutchinson-based nonnormality detection using the commutator W^T W - W W^T
Adaptive Krylov rank correction blended into the Muon direction
Autoregressive self-generated full-Hessian GPTQ int6 calibration
Selective ±1 pruning to fit the 16MB artifact cap
Strong 11-layer SentencePiece GPT stack with XSA, BigramHash, SmearGate, VE128, partial RoPE, and U-Net skips