val_bpb
1.1787
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.56 MB
Training Techniques
Architecture
Transformer depth / tied embeddings / KV head count
10-layer transformer with 512-dimensional hidden size, 8 attention heads, 4 KV heads, and tied embeddings.
parameters: {"layers":10,"dimensions":512,"heads":8,"kv_heads":4}
weight tying
Tied embeddings with FP16 embeddings used to avoid int8 error compounding.
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Optimizer
Muon
weight_decay: null
momentum: 0.98
other_params: {"matrix_lr":0.03,"scalar_lr":0.03}
LR Schedule
warmdown
parameters: {"warmdown_steps":15000,"always_decaying":true}
Regularization
gradient clipping
parameters: {"grad_clip_norm":1}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"targets":["Q projections","V projections","LM head"],"chunk_size":256}
Initialization
spectral init / residual mixing
Overtone spectral embedding initialization with phase-transition residual mixing.
Quantization
int8
bits: 8
scope: per-row weights
Compression
zlib
level: null
Novel Contributions
- 10-layer transformer with tuned hyperparameters for the 10-minute budget
- Sequence length increased to 2048 for richer context
- Always-decaying warmdown schedule to tighten weights and reduce quantization penalty
- Test-time training with batched LoRA adapters on Q, V projections and LM head
- Overtone spectral embedding initialization with phase-transition residual mixing
- Int8 per-row quantization combined with zlib compression
- FP16 tied embeddings to reduce quantization error compounding