PR #485

open

Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)

by harsha-gouruView on GitHub

val_bpb

1.1522

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.38 MB

Training Techniques

Architecture

BigramLogitHead

Count-initialized exact bigram lookup table used as a logit bias head before softcap, initialized from corpus transition probabilities.

parameters: {"vocab_size":1024,"clipping":"[-4, 4]","smoothing_alpha":0.25}

XSA

Exclusive Self Attention applied to the last 4 layers.

parameters: {"layers":4}

Partial RoPE

Rotary position embeddings applied to only part of the head dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

MLP hidden size expanded to 3x the model dimension.

parameters: {"hidden_size":1536}

Quantization

int4

bits: 4

scope: bigram logit table

int5

bits: 5

scope: MLP weights

int6

bits: 6

scope: attention weights

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"used_for":"embeddings/scalars"}

Weight Averaging

SWA

parameters: {"start_frac":0.4,"every":50,"checkpoints":22}

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

OrthoInit

Orthogonal initialization with muP-scaled output projections.

LR Schedule

warmdown

parameters: {"warmdown_iters":2800,"warmup_steps":20}

Compression

zstd

level: 22

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Novel Contributions

Count-initialized exact bigram logit head initialized from corpus transition probabilities
Int4 nibble packing for signed int4 values to halve bigram table storage
XSA on the last 4 layers
Partial RoPE on 16 of 64 dimensions
Layerwise LN scaling by 1/sqrt(layer+1)