PR #1069

closed

Non-record: 1.1190 BPB — Independent PR #549 Reproduction (10min 8×H100)

by manfromnowhere143View on GitHub

val_bpb

1.1190

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15,948,863 bytes

Training Techniques

Architecture

LeakyReLU

Uses LeakyReLU squared activation in the MLP.

parameters: {"slope":0.5}

XSA

Uses XSA4 attention/sequence mechanism.

parameters: null

Partial RoPE

Applies rotary position embeddings to only part of the head dimension.

parameters: {"partial":"16/64"}

SmearGate

Adds SmearGate to the model.

parameters: null

BigramHash

Adds bigram hash embeddings/features.

parameters: null

VE128

Uses value embeddings / value residual style features.

parameters: null

Regularization

LN scale

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Quantization

GPTQ-lite

bits: 6

scope: weights

GPTQ-lite

bits: 8

scope: weights

Test-Time Training

score-first TTT

parameters: {"steps":3,"learning_rate":0.0001}

Novel Contributions

Independent reproduction and slight improvement of PR #549's stack
11-layer 512-d model with LeakyReLU², XSA4, Partial RoPE, LN Scale, EMA, Parallel Muon, GPTQ-lite, SmearGate, BigramHash, ValueEmbedding, and score-first TTT
Achieved 1.1190 BPB under standard competition constraints
Reported 7,166 steps in 600 seconds on 8×H100 SXM