val_bpb
1.1365
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,759,319
Training Techniques
Architecture
XSA
Uses XSA in the last 4 layers of the model.
parameters: {"layers":4}
Partial RoPE
Applies rotary positional embeddings to only part of the dimensions.
parameters: {"dimensions":"16/64"}
SmearGate
Adds SmearGate to the architecture.
parameters: null
BigramHash
Uses a BigramHash component with a 10240 vocabulary/hash size.
parameters: {"size":10240}
MLP3x
Uses a 3x wider MLP block.
parameters: {"layers":3}
Weight Averaging
EMA
parameters: {"decay":0.997}
Regularization
LN Scale
parameters: null
Quantization
mixed int5/int6
bits: null
scope: MLP and attention
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: null
AdamW
weight_decay: null
momentum: null
other_params: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Novel Contributions
- 10-layer 512d Transformer with XSA in the last 4 layers
- EMA with decay 0.997
- Partial RoPE applied to 16/64 dimensions
- LN Scale
- SmearGate and BigramHash(10240, 128)
- Mixed int5 MLP / int6 attention quantization
- 3.2% pruning
- zstd-22 artifact compression
- Sliding window evaluation with stride 64