val_bpb
1.3572
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.47MB
Training Techniques
Architecture
Partial RoPE
Applied rotary position embeddings to only a subset of attention head dimensions, leaving the rest position-free.
parameters: {"dimensions":16,"total_dimensions":64,"fraction":0.25}
GQA
Used grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
SmearGate
Included SmearGate in the model architecture.
parameters: null
BigramHash
Used BigramHashEmbedding for token representation.
parameters: {"bigrams":10240,"dim":128}
MLP3x
Used a 3.0x MLP expansion.
parameters: {"multiplier":3}
ReLU²
Used ReLU squared activation in the MLP.
parameters: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Weight Averaging
SWA
parameters: {"start_phase":"warmdown","checkpoint_interval":25}
LR Schedule
warmdown
parameters: {"warmdown_steps":1000}
Regularization
magnitude pruning
parameters: {"sparsity":0.1}
Compression
zstd
level: 22
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"embeddings_scalars_optimizer":"AdamW"}
Quantization
mixed int5/int6
bits: null
scope: MLP and attention weights
Novel Contributions
- Partial RoPE applied to 16 of 64 attention dimensions
- Longer training sequence length of 1024
- Reduced batch tokens to increase optimization steps under wallclock limit
- Warmdown and SWA tuning for the run
- Aggressive magnitude pruning before quantization
- Reproducibility records with three seeds