PR #1917

open

Record: SP4096 5L MLP6 BigramHash XSA5 — val_bpb 1.1636

by Blitzo125View on GitHub
val_bpb
1.1636
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,753,699 bytes

Training Techniques

Architecture
BigramHash
Adds bigram hash features to token embeddings.
parameters: {"dim":256,"vocab_size":4096}
XSA
Applies cross-sequence attention on all layers.
parameters: {"layers":5}
weight tying
Tied embeddings are used.
parameters: null
Partial RoPE
Uses partial rotary positional embeddings.
parameters: {"dims":32}
KV head count
Uses grouped key/value heads.
parameters: {"num_heads":8,"num_kv_heads":4}
MLP6
Uses a 6x MLP expansion in a 5-layer model.
parameters: {"layers":5,"mlp_mult":6,"model_dim":512}
Quantization
int8
bits: 8
scope: all
Compression
brotli
level: 11
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":250,"warmdown_iters":1400}
Regularization
logit softcap
parameters: {"value":30}
Initialization
kaiming init
Kaiming initialization is required for BigramHash embeddings.

Novel Contributions

  • 4096-vocabulary SentencePiece tokenizer
  • 5-layer wider model with MLP expansion 6x
  • BigramHash embeddings with kaiming initialization
  • Cross-sequence attention on all 5 layers
  • Brotli-11 compression for int8 weights