PR #477

closed

Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)

by harsha-gouruView on GitHub

val_bpb

1.1522

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.38 MB

Training Techniques

Architecture

BigramLogitHead

1024x1024 count-initialized exact bigram lookup table used as logit biases before softcap.

parameters: {"size":"1024x1024"}

XSA

Exclusive Self Attention applied to the last 4 layers to remove self-value component from attention output.

parameters: {"layers":4}

Partial RoPE

Rotary position embeddings applied to only part of the head dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025,"warmdown":2800,"warmup":20,"grad_clip":0.3}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"used_for":"embeddings/scalars"}

Weight Averaging

SWA

parameters: {"start_frac":0.4,"every":50,"checkpoints":22}

Quantization

mixed int5/int6

bits: 5

scope: MLP weights and attention weights

int4

bits: 4

scope: bigram logit table

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Initialization

OrthoInit

Orthogonal initialization with muP-scaled output projections.

Other

other

Count-initialized exact bigram logit head computed from corpus transition probabilities with additive smoothing and clipping.

parameters: {"smoothing_alpha":0.25,"clip_range":"[-4, 4]","tokens_used":16000000}

other

Custom int4 nibble packing/unpacking for signed values to reduce storage of the bigram table.

parameters: {"values_per_byte":2}

Novel Contributions

Count-initialized exact bigram logit head derived from corpus transition probabilities
Custom int4 nibble packing for the bigram logit table
Combination of count-init bigram head with XSA, Partial RoPE, and LN Scale