PR #482

closed

Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)

by harsha-gouruView on GitHub

val_bpb

1.1522

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.38 MB

Training Techniques

Architecture

BigramLogitHead

Count-initialized exact bigram logit lookup table used as a strong Markov prior before training; applied before logit softcap.

parameters: {"size":"1024x1024"}

XSA

Exclusive Self Attention applied to the last 4 layers.

parameters: {"layers":4}

Partial RoPE

Rotary position embeddings applied to only part of the head dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

Quantization

int4

bits: 4

scope: bigram logit table

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"used_for":"embeddings/scalars"}

Weight Averaging

SWA

parameters: {"start_frac":0.4,"every":50,"checkpoints":22}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":2800,"warmup_steps":20}

Initialization

OrthoInit

Orthogonal initialization with muP-scaled output projections.

Compression

zstd

level: 22

Other

other

Int4 nibble packing/unpacking for signed values, storing two int4 values per byte to reduce bigram table size.

parameters: {"packing":"two values per byte"}

Novel Contributions

Count-initialized exact bigram logit head using corpus bigram transition probabilities as a strong Markov prior
Int4 nibble packing for the bigram logit table to halve storage cost
XSA on the last 4 layers
Partial RoPE on 16 of 64 dimensions
Layer-wise LN scaling by 1/sqrt(layer+1)