PR #1709
openNon-record: XSA F.normalize fix + byte-shuffle/brotli + Muon WD as compression knob
by Bananakin1View on GitHub
val_bpb
1.1470
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.96 MB
Training Techniques
Architecture
XSA
Cross-subtraction attention with a corrected normalization step; replaces erroneous rms_norm usage with normalize for unit-vector projection removal.
parameters: {"layers":11}
BigramHash
Bigram hash embedding component used in the model.
parameters: {"buckets":3072,"dimensions":112}
SmearGate
SmearGate module included in the architecture.
parameters: null
weight tying
Tied input and output embeddings.
parameters: null
LeakyReLU
LeakyReLU squared activation.
parameters: {"negative_slope":0.5}
U-Net skip connections
U-Net style skip connections in the transformer stack.
parameters: null
RoPE
Rotary positional embeddings with partial application.
parameters: {"dimensions":16,"base":10000}
Quantization
GPTQ-lite
bits: 6
scope: all
Optimizer
Muon
weight_decay: 0.085
momentum: 0.99
other_params: {"matrix_lr":0.025}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":2500}
Compression
brotli
level: 11
lzma
level: 9
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Regularization
weight decay
parameters: {"value":0.085}
Novel Contributions
- Verified bug fix for XSA normalization: replacing F.rms_norm with F.normalize to avoid over-subtracting the projection by head_dim.
- Byte-shuffle with stride 2 followed by brotli q=11 for artifact compression, with LZMA-9 fallback.
- Hypothesis that Muon weight decay acts as a compression-aware knob affecting artifact size.
- Single-point comparison showing brotli+shuffle produced a smaller artifact than LZMA-9 on one checkpoint.