PR #1392

open

Add record: SP4096 + Depth Recurrence + Parallel Residuals + QK-Gain + Brotli (1.1020 BPB)

by Its-Just-CrumpView on GitHub

val_bpb

1.1020

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.88 MB

Training Techniques

Architecture

SentencePiece 4096

Uses a 4096-token SentencePiece tokenizer instead of the baseline SP1024.

parameters: {"vocab_size":4096}

MLP4x

Increases MLP width to 4x capacity with LeakyReLU squared activation.

parameters: {"multiplier":4}

depth recurrence

Re-executes layers 4 and 5 starting at step 3000 to add logical depth without adding parameters.

parameters: {"layers":[4,5],"start_step":3000}

Parallel Residuals

Merges MLP and attention outputs with learned lane_merge and resid_mix_mlp parameters from layer 7 onward.

parameters: {"start_layer":7}

QK-Gain

Initializes query and key projections with 5x scale.

parameters: {"gain":5}

RoPE

Uses partial rotary positional embeddings.

parameters: {"dimensions":16,"total_dimensions":64}

VE128

Applies VE128 in layers 9-10.

parameters: {"layers":[9,10]}

SmearGate

Uses a position-mixing gate.

parameters: null

U-Net skip connections

Adds encoder-decoder skip connections.

parameters: null

XSA

Uses XSA attention on all layers.

parameters: {"layers":11}

Optimizer

Muon

weight_decay: 0.09

momentum: null

other_params: {"parallel":true,"muon_eq_r":true}

AdamW

weight_decay: 0.09

momentum: null

other_params: null

Quantization

GPTQ

bits: 6

scope: all

Weight Averaging

EMA + Tight SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Compression

Brotli

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_steps":4000}

Regularization

weight decay

parameters: {"adam_wd":0.09,"muon_wd":0.09}

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Novel Contributions

SP4096-native architecture replacing SP1024 + BigramHash
Depth recurrence on layers 4-5 starting at step 3000
Parallel residuals with learned merge from layer 7 onward
QK-Gain 5.0 initialization
MuonEq-R optimizer variant
GPTQ calibration with 128 batches
Brotli compression for SP4096 int6 weights
Removal of BigramHash, TrigramHash, and TTT from the prior stack