PR #477

closed

Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)

by harsha-gouruView on GitHub
val_bpb
1.1522
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.38 MB

Training Techniques

Architecture
BigramLogitHead
1024x1024 count-initialized exact bigram lookup table used as logit biases before softcap.
parameters: {"size":"1024x1024"}
XSA
Exclusive Self Attention applied to the last 4 layers to remove self-value component from attention output.
parameters: {"layers":4}
Partial RoPE
Rotary position embeddings applied to only part of the head dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025,"warmdown":2800,"warmup":20,"grad_clip":0.3}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
SWA
parameters: {"start_frac":0.4,"every":50,"checkpoints":22}
Quantization
mixed int5/int6
bits: 5
scope: MLP weights and attention weights
int4
bits: 4
scope: bigram logit table
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Initialization
OrthoInit
Orthogonal initialization with muP-scaled output projections.
Other
other
Count-initialized exact bigram logit head computed from corpus transition probabilities with additive smoothing and clipping.
parameters: {"smoothing_alpha":0.25,"clip_range":"[-4, 4]","tokens_used":16000000}
other
Custom int4 nibble packing/unpacking for signed values to reduce storage of the bigram table.
parameters: {"values_per_byte":2}

Novel Contributions

  • Count-initialized exact bigram logit head derived from corpus transition probabilities
  • Custom int4 nibble packing for the bigram logit table
  • Combination of count-init bigram head with XSA, Partial RoPE, and LN Scale