PR #564

open

Record: 11L Tight SWA + Partial RoPE + LN Scale + XSA4 (val_bpb: 1.1270)

by sadeghja1070View on GitHub

val_bpb

1.1270

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.5 MB

Training Techniques

Weight Averaging

SWA

parameters: {"name":"Tight SWA","scale_threshold":0.2,"frequency_steps":50,"checkpoint_count":12,"description":"SWA checkpoint collection restricted to scale<0.2 (last ~600 steps), every 50 steps, eliminating SWA quality penalty while maintaining quantization-friendly weight averaging."}

Architecture

XSA

Exclusive Self Attention applied on last 4 layers

parameters: {"layers":4}

Partial RoPE

Rotary Positional Embeddings applied partially on 16/64 dimensions with NTK-aware scaling

parameters: {"dimensions":16,"total_dimensions":64}

LN Scale

LayerNorm scale factor set to 1/sqrt(layer_idx+1)

parameters: null

SmearGate

SmearGate applied as architectural component

parameters: null

BigramHash

Bigram hashing with 2048 buckets and dimension 128

parameters: {"buckets":2048,"dimension":128}

MLP3x

MLP expansion factor 3 with relu-squared activation

parameters: {"expansion_factor":3,"activation":"relu-squared"}

weight tying

Tied embeddings and output embeddings

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"learning_rate_matrix":0.025,"learning_rate_scalar":0.025,"learning_rate_embedding":0.035,"gradient_clip":0.3,"adamw_for_embeddings":true}

Compression

zstd

level: 22

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

OrthoInit

Orthogonal initialization with projection scaling by 1/sqrt(2*num_layers)

Regularization

weight decay

parameters: {"value":0.04}

Other

other

U-Net skip connections with 5 encoder and 6 decoder layers

parameters: {"encoder_layers":5,"decoder_layers":6}

Novel Contributions

Tight SWA: Restricting SWA checkpoint collection to scale<0.2 in last ~600 steps every 50 steps to eliminate SWA quality penalty while maintaining quantization-friendly weight averaging.
Use of Partial RoPE on 16/64 dimensions with NTK-aware scaling.
Applying Exclusive Self Attention (XSA) on last 4 layers.
LayerNorm scale factor set to 1/sqrt(layer_idx+1).
Combination of SmearGate and BigramHash (2048 buckets, dim=128) in architecture.
Int6 per-row quantization for MLP and attention weights combined with Int8 per-row for embeddings.
Orthogonal initialization with projection scaling by 1/sqrt(2*num_layers).
Use of Muon optimizer with momentum warmup and separate AdamW for embeddings and scalars.
U-Net style skip connections with 5 encoder and 6 decoder layers.