val_bpb
1.2037
Architecture
Transformer
Optimizer
Muon
Artifact Size
12,499,612 bytes
Training Techniques
Quantization
mixed int5/int6
bits: null
scope: INT5 for MLP weights, INT6 for attention weights
Architecture
SmearGate
Custom gating mechanism used in the model.
parameters: null
BigramHash
Bigram hashing component with size 2048.
parameters: {"dimensions":2048}
OrthoInit
Orthogonal initialization.
parameters: null
MLP3x
Transformer MLP expanded to 3x hidden size.
parameters: {"layers":10,"dim":512,"mlp_multiplier":3}
tied embeddings
FP16 tied embedding weights.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"decoupled_weight_decay":true}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"scope":"embeddings and scalars"}
Weight Averaging
EMA
parameters: {"decay":0.999,"every_steps":10}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"seq":2048}
Initialization
OrthoInit
Orthogonal initialization used for model weights.
Sequence Length
sequence_length
train_length: null
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Regularization
weight decay
parameters: {"value":0.04}
pruning
parameters: {"sparsity":0.03}
Novel Contributions
- 10-layer transformer with mixed INT5/INT6 quantization
- SmearGate + BigramHash + OrthoInit integration
- Muon optimizer with decoupled weight decay
- EMA weight averaging
- 3% magnitude pruning before export
- Sliding window evaluation with stride 64
- RoPE base 50K
- Late-K passthrough for the last 2 layers