PR #349

open

Record: 11L XSA + EMA + Int5-MLP (val_bpb=1.1399)

by MapikaView on GitHub

val_bpb

1.1399

Architecture

Transformer

Optimizer

Muon

Artifact Size

under 16MB

Training Techniques

Architecture

XSA

Exclusive Self-Attention applied to the last 4 of 11 layers.

parameters: {"layers":4,"total_layers":11}

SmearGate

Custom gating mechanism used in the architecture.

parameters: null

BigramHash

BigramHash feature module with 2048 buckets and 128-dim embeddings.

parameters: {"buckets":2048,"dim":128}

U-Net skip connections

Skip connections inspired by U-Net added to the Transformer.

parameters: null

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Weight Averaging

EMA

parameters: {"decay":0.997,"update_frequency":"every step","device":"GPU","dtype":"float32"}

Quantization

mixed int5/int6/int8

bits: null

scope: MLP weights int5, attention weights int6, embeddings int8/FP16 for small tensors

Compression

zstd

level: 22

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025,"scalar_lr":0.025}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings","tied_embed_lr":0.035}

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

cosine warmdown

parameters: {"warmdown_steps":3000,"warmup_steps":20}

Regularization

weight decay

parameters: {"value":0.04}

magnitude pruning

parameters: {"pruning_ratio":0.08}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Novel Contributions

11-layer Transformer with XSA applied to the last 4 layers
Continuous GPU float32 EMA updated every step without CPU transfers
Mixed int5 MLP / int6 attention / int8 embedding quantization
8% magnitude pruning combined with zstd-22 compression
Sliding-window evaluation with stride 64
Muon optimizer with cosine warmdown schedule