PR #1876

open

Non-record: Coprime-Stride Loader + Full GPTQ + Score-First TTT (3-seed mean 1.08008 BPB)

by Meirzhan05View on GitHub

val_bpb

1.0801

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.99 MB

Training Techniques

Architecture

depth recurrence

Layers 3-5 are looped 3x to create virtual depth.

parameters: {"layers":[3,5],"loops":3}

GQA

Uses grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

XSA

XSA applied on all layers.

parameters: {"layers":11}

U-Net skip connections

U-Net style skip connections with learnable gates.

parameters: null

weight tying

Tied embeddings.

parameters: null

LeakyReLU

MLP uses LeakyReLU(0.5)^2 activation.

parameters: {"slope":0.5}

Quantization

GPTQ

bits: null

scope: all

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"newton_schulz_steps":5}

AdamW

weight_decay: null

momentum: null

other_params: {"scope":"embeddings/scalars"}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Compression

lzma

level: null

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":3,"chunk_size":32000}

Evaluation

sliding window eval

parameters: null

LR Schedule

warmdown

parameters: {"warmdown":0.72}

linear warmup

parameters: {"steps":20}

Regularization

logit softcap

parameters: {"value":30}

Novel Contributions

Coprime-stride multi-shard loader with progress-based adaptive shard selection
Full Hessian GPTQ with Cholesky fallback for ill-conditioned matrices
LZMA-compressed self-extracting Python submission artifact
Score-first test-time training that scores each chunk before updating