PR #1431

open

sp4096 + 10L 3.5x MLP + GPTQ + TTT (1.1266 BPB)

by Idan3011View on GitHub

val_bpb

1.1266

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.99 MB

Training Techniques

Architecture

weight tying

Tied input and output embeddings using a single shared embedding matrix.

parameters: null

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

Expanded MLP width beyond the baseline.

parameters: {"multiplier":3.5,"hidden_dim":1792}

U-Net skip connections

Added encoder-decoder style skip connections between matching layers.

parameters: null

LeakyReLU

Used LeakyReLU squared activation in the MLP.

parameters: {"slope":0.5}

XSA

Applied cross-sequence attention in the last layers.

parameters: {"layers":4}

Quantization

GPTQ

bits: 5

scope: attention and MLP weights

QAT

bits: 5

scope: MLP layers

int8

bits: 8

scope: tied embeddings

Compression

brotli

level: 11

Other

other

Applied byte-shuffle pre-filter before brotli to improve compression of quantized weights.

parameters: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.003,"epochs_per_chunk":20,"chunks":348}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Adam

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997}

LR Schedule

warmdown

parameters: {"fraction":0.35}

cosine decay

parameters: {"used_for":"TTT","lr":0.003}

Regularization

logit softcap

parameters: {"cap":30}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Novel Contributions

Custom sp4096 SentencePiece tokenizer hosted on HuggingFace
Mixed int5/int8 quantization scheme with int8 tied embeddings
Byte-shuffle plus brotli compression to fit under the 16MB cap
GPTQ calibration using self-generated autoregressive sequences
Score-first test-time training with full-block SGD adaptation