PR #37

closed

Record: SP4096 + Int6 QAT + NorMuon (val_bpb=1.2012)

val_bpb

1.2012

Architecture

Transformer

Optimizer

NorMuon

Artifact Size

14.3MB

Training Techniques

Quantization

STE QAT

bits: 6

scope: all

Architecture

tied embeddings

Uses untied input and output embeddings instead of weight tying.

parameters: {"tie_embeddings":0}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"variant":"NorMuon","tuned_learning_rates":{"input_embeddings":0.6,"output_head":0.008}}

Compression

zstd

level: 22

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Other

other

Uses a larger SP4096 SentencePiece BPE tokenizer trained on FineWeb to improve tokens-per-byte compression.

parameters: {"vocab_size":4096,"tokens_per_byte":0.306}