PR #458

open

10L XSA + EMA + Partial RoPE + LN Scale (val_bpb: 1.1365)

by ofirkrisView on GitHub

val_bpb

1.1365

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,759,319

Training Techniques

Architecture

XSA

Uses XSA in the last 4 layers of the model.

parameters: {"layers":4}

Partial RoPE

Applies rotary positional embeddings to only part of the dimensions.

parameters: {"dimensions":"16/64"}

SmearGate

Adds SmearGate to the architecture.

parameters: null

BigramHash

Uses a BigramHash component with a 10240 vocabulary/hash size.

parameters: {"size":10240}

MLP3x

Uses a 3x wider MLP block.

parameters: {"layers":3}

Weight Averaging

EMA

parameters: {"decay":0.997}

Regularization

LN Scale

parameters: null

Quantization

mixed int5/int6

bits: null

scope: MLP and attention

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: null

AdamW

weight_decay: null

momentum: null

other_params: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Novel Contributions

10-layer 512d Transformer with XSA in the last 4 layers
EMA with decay 0.997
Partial RoPE applied to 16/64 dimensions
LN Scale
SmearGate and BigramHash(10240, 128)
Mixed int5 MLP / int6 attention quantization
3.2% pruning
zstd-22 artifact compression
Sliding window evaluation with stride 64