PR #1723

open

Add Nairi submission: 9L 512D vocab1024

val_bpb

0.5116

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

U-Net skip connections

Encoder-decoder style skip connections to improve gradient flow.

parameters: null

KV head count

Uses grouped-query style attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"matrix_lr":0.04}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Other

other

Uses a smaller model vocabulary size than the tokenizer vocabulary for compression efficiency.

parameters: {"model_vocab":1024,"tokenizer_vocab":8192}