PR #78

open

Record: 8192 Vocab Size, NorMuon, Selective Quantization; 1.186 val_bpb

val_bpb

1.1858

Architecture

Transformer

Optimizer

NorMuon

Artifact Size

14796836 bytes

Training Techniques

Quantization

mixed int6/int8

bits: 6

scope: weights int6, embeddings int8

Optimizer

NorMuon

weight_decay: null

momentum: null

other_params: null

Architecture

vocab size

Increased tokenizer/model vocabulary from 1024 to 8192, requiring a layer reduction to fit constraints.

parameters: {"vocab_size":8192,"layers":8}

Sequence Length

sequence_length

train_length: 4096

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Compression

zlib

level: null