PR #315

closed

Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)

by jfprinczView on GitHub

val_bpb

1.1248

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.6 MB

Training Techniques

Architecture

Partial RoPE

Apply rotary position embeddings to only part of the head dimensions, leaving the rest position-free.

parameters: {"dimensions":16,"total_dimensions":64}

XSA

Exclusive Self Attention applied to the last 4 layers.

parameters: {"layers":4}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer_idx+1)"}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

mixed int6/int8

bits: 6

scope: MLP and attention int6; embeddings int8

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

OrthoInit

Orthogonal initialization with muP scaling on large matrices.

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_iters":3000,"warmup_steps":1500}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"adam_weight_decay":0.04}

Novel Contributions

Partial RoPE applied to only 16 of 64 head dimensions
LayerNorm/RMSNorm output scaling by 1/sqrt(layer_idx+1)
11-layer Transformer with XSA on the last 4 layers
EMA weight averaging with decay 0.997
Mixed int6/int8 quantization with zstd compression
Late QAT flag was present but had no effect due to torch.compile constant folding