PR #720

open

Record Submission: 1.1078 BPB — XSA6 + BigramHash4K on Hedge Mixer Stack

by agalimovaView on GitHub

val_bpb

1.1078

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.3MB

Training Techniques

Architecture

XSA

Applies XSA to the last layers of the model.

parameters: {"layers":6}

BigramHash

Uses hashed bigram embeddings in the Hedge Mixer stack.

parameters: {"vocab_size":4096,"embedding_dim":128}

Partial RoPE

Uses rotary positional embeddings on a subset of dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"parameter_banking":true}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"frequency_steps":50}

Quantization

GPTQ-lite

bits: 6

scope: model weights

Compression

zstd

level: null

Test-Time Training

score-first TTT

parameters: {"epochs":4,"optimizer":"AdamW"}

LR Schedule

cosine decay

parameters: {"warmdown":true}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Novel Contributions

Systematic combinatorial search over hyperparameters using autoresearch-multi
Increasing XSA_LAST_N from 4 to 6
Increasing BIGRAM_VOCAB_SIZE from 2048 to 4096
Combination of XSA=6 and BigramHash vocab size 4096 with superadditive improvement
Hedge Mixer stack with BigramHash embeddings and XSA on the last 6 layers