val_bpb
1.2036
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
tied embeddings
Input and output embeddings are tied for the model.
parameters: null
RoPE
NTK-aware RoPE scaling for longer-context evaluation/training.
parameters: null
KV head count
Uses grouped-query style attention with fewer KV heads than query heads.
parameters: {"num_heads":8,"num_kv_heads":4}
phase-transition resid_mix
Applies phase-transition residual mixing in the architecture.
parameters: null
Optimizer
Muon
weight_decay: 0.02
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.04}
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
overtone init
Uses overtone embedding initialization.
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
extended warmup
parameters: {"warmup_steps":1500}
Quantization
mixed-bit lowbit export
bits: null
scope: selected block weights
Novel Contributions
- Long-context training with sequence length 4096
- Sliding-window evaluation with stride 64
- FP16 tied embedding export
- Overtone embedding initialization
- Phase-transition residual mixing
- NTK-aware RoPE scaling
- Lower learning rates with higher Muon momentum and extended warmup
- Optional mixed-bit lowbit export for deeper models