PR #1709

open

Non-record: XSA F.normalize fix + byte-shuffle/brotli + Muon WD as compression knob

by Bananakin1View on GitHub

val_bpb

1.1470

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.96 MB

Training Techniques

Architecture

XSA

Cross-subtraction attention with a corrected normalization step; replaces erroneous rms_norm usage with normalize for unit-vector projection removal.

parameters: {"layers":11}

BigramHash

Bigram hash embedding component used in the model.

parameters: {"buckets":3072,"dimensions":112}

SmearGate

SmearGate module included in the architecture.

parameters: null

weight tying

Tied input and output embeddings.

parameters: null

LeakyReLU

LeakyReLU squared activation.

parameters: {"negative_slope":0.5}

U-Net skip connections

U-Net style skip connections in the transformer stack.

parameters: null

RoPE

Rotary positional embeddings with partial application.

parameters: {"dimensions":16,"base":10000}

Quantization

GPTQ-lite

bits: 6

scope: all

Optimizer

Muon

weight_decay: 0.085

momentum: 0.99

other_params: {"matrix_lr":0.025}

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_steps":2500}

Compression

brotli

level: 11

lzma

level: 9

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Regularization

weight decay

parameters: {"value":0.085}

Novel Contributions

Verified bug fix for XSA normalization: replacing F.rms_norm with F.normalize to avoid over-subtracting the projection by head_dim.
Byte-shuffle with stride 2 followed by brotli q=11 for artifact compression, with LZMA-9 fallback.
Hypothesis that Muon weight decay acts as a compression-aware knob affecting artifact size.
Single-point comparison showing brotli+shuffle produced a smaller artifact than LZMA-9 on one checkpoint.