PR #1897

open

Record: SP4096 5L MLP6 BigramHash XSA5 — val_bpb 1.1636

by Blitzo125View on GitHub

val_bpb

1.1636

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,753,699 bytes

Training Techniques

Architecture

BigramHash

Adds bigram hash features to token embeddings.

parameters: {"dimensions":256,"vocab_size":4096}

XSA

Cross-sequence attention applied to all layers.

parameters: {"layers":5}

weight tying

Tied embeddings are used.

parameters: null

KV head count

Uses grouped-query style key/value head reduction.

parameters: {"num_heads":8,"num_kv_heads":4}

Partial RoPE

Applies RoPE to only part of the head dimension.

parameters: {"dims":32}

MLP6

Uses 6x MLP expansion in a 5-layer model.

parameters: {"layers":5,"mlp_mult":6,"model_dim":512}

Quantization

int8

bits: 8

scope: all

Compression

brotli

level: 11

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmup_steps":250,"warmdown_iters":1400}

Regularization

logit softcap

parameters: {"value":30}

Optimizer

Muon

weight_decay: 0.055

momentum: null

other_params: {"beta2":0.98,"matrix_lr":0.04,"scalar_lr":0.03,"tied_embed_lr":0.03}

Novel Contributions

4096-vocabulary SentencePiece tokenizer for more efficient tokenization
5-layer, wider MLP6 architecture tuned for a short training budget
BigramHash embeddings with kaiming initialization
Cross-sequence attention applied to all 5 layers
Brotli-11 compression for int8 weights