PR #1276

open

Record: MuonEq-R + Context-Only SLOT + XSA-all + QK-Gain 5.0

by BiggerDABOSSView on GitHub

val_bpb

1.1100

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~16 MB

Training Techniques

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"variant":"MuonEq-R","parallel":true}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"batch_seqs":32,"grad_clip":1}

Architecture

XSA

Cross-sequence attention extended to all layers

parameters: {"layers":11}

BigramHash

Bigram hash embedding component

parameters: {"vocab_size":1536}

Partial RoPE

Rotary positional embeddings applied to a subset of dimensions

parameters: {"dimensions":16}

VE128

Value residual enhancement module

parameters: {"layers":[9,10],"dimension":128}

MLP3x

Three-times widened MLP with LeakyReLU activation

parameters: {"activation":"LeakyReLU","activation_slope":0.5}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Weight Averaging

EMA + Tight SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Quantization

GPTQ-lite

bits: 6

scope: all

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64,"seq_len":2048}

Sequence Length

sequence_length

train_length: 32768

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Other

other

Context-only SLOT test-time delta optimization on past tokens only, reinitialized each sliding window

parameters: {"delta_shape":[1,1,512],"optimizer":"AdamW","learning_rate":0.005,"steps":8}

Novel Contributions

MuonEq-R row-normalization before Newton-Schulz orthogonalization
Context-only SLOT optimized on past tokens during sliding-window evaluation
XSA extended from last 4 layers to all 11 layers
QK gain increased to 5.0
Combined stack targeting about 1.110 val_bpb