PR #482
closedRecord: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)
by harsha-gouruView on GitHub
val_bpb
1.1522
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.38 MB
Training Techniques
Architecture
BigramLogitHead
Count-initialized exact bigram logit lookup table used as a strong Markov prior before training; applied before logit softcap.
parameters: {"size":"1024x1024"}
XSA
Exclusive Self Attention applied to the last 4 layers.
parameters: {"layers":4}
Partial RoPE
Rotary position embeddings applied to only part of the head dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
Quantization
int4
bits: 4
scope: bigram logit table
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
SWA
parameters: {"start_frac":0.4,"every":50,"checkpoints":22}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":2800,"warmup_steps":20}
Initialization
OrthoInit
Orthogonal initialization with muP-scaled output projections.
Compression
zstd
level: 22
Other
other
Int4 nibble packing/unpacking for signed values, storing two int4 values per byte to reduce bigram table size.
parameters: {"packing":"two values per byte"}
Novel Contributions
- Count-initialized exact bigram logit head using corpus bigram transition probabilities as a strong Markov prior
- Int4 nibble packing for the bigram logit table to halve storage cost
- XSA on the last 4 layers
- Partial RoPE on 16 of 64 dimensions
- Layer-wise LN scaling by 1/sqrt(layer+1)