PR #1723

open

Add Nairi submission: 9L 512D vocab1024

val_bpb
0.5116
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Architecture
weight tying
Tied input and output embeddings.
parameters: null
U-Net skip connections
Encoder-decoder style skip connections to improve gradient flow.
parameters: null
KV head count
Uses grouped-query style attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"matrix_lr":0.04}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
Uses a smaller model vocabulary size than the tokenizer vocabulary for compression efficiency.
parameters: {"model_vocab":1024,"tokenizer_vocab":8192}

Novel Contributions

  • Encoder-decoder skip connections for improved gradient flow
  • Muon optimizer for fast convergence
  • Tied embeddings
  • Model vocabulary size reduced to 1024 for compression efficiency
  • Int8 quantization with zlib-compressed artifact