PR #485
openRecord: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)
by harsha-gouruView on GitHub
val_bpb
1.1522
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.38 MB
Training Techniques
Architecture
BigramLogitHead
Count-initialized exact bigram lookup table used as a logit bias head before softcap, initialized from corpus transition probabilities.
parameters: {"vocab_size":1024,"clipping":"[-4, 4]","smoothing_alpha":0.25}
XSA
Exclusive Self Attention applied to the last 4 layers.
parameters: {"layers":4}
Partial RoPE
Rotary position embeddings applied to only part of the head dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
MLP hidden size expanded to 3x the model dimension.
parameters: {"hidden_size":1536}
Quantization
int4
bits: 4
scope: bigram logit table
int5
bits: 5
scope: MLP weights
int6
bits: 6
scope: attention weights
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
SWA
parameters: {"start_frac":0.4,"every":50,"checkpoints":22}
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal initialization with muP-scaled output projections.
LR Schedule
warmdown
parameters: {"warmdown_iters":2800,"warmup_steps":20}
Compression
zstd
level: 22
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Novel Contributions
- Count-initialized exact bigram logit head initialized from corpus transition probabilities
- Int4 nibble packing for signed int4 values to halve bigram table storage
- XSA on the last 4 layers
- Partial RoPE on 16 of 64 dimensions
- Layerwise LN scaling by 1/sqrt(layer+1)