val_bpb
1.4106
Architecture
Transformer
Optimizer
Muon
Artifact Size
11,124,153 bytes
Training Techniques
Evaluation
sliding window eval
parameters: {"stride":64}
Quantization
fp16
bits: 16
scope: tied embeddings
Architecture
tied embeddings
FP16 tied embedding export with tok_emb kept in fp16
parameters: null
weight tying
Tied embedding setup as part of the PR60-style stack
parameters: null
Transformer
10-layer transformer model
parameters: {"layers":10}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"decoupled_weight_decay":true}
Initialization
spectral init
Overtone spectral embedding initialization
resid mix
Phase-transition residual mixing
Regularization
weight decay
parameters: {"type":"decoupled"}
Compression
zlib
level: null
Novel Contributions
- Non-record local reproduction of the PR60-style stack on 1x A100 hardware
- Sliding-window final evaluation with stride 64
- FP16 tied embedding export with tok_emb kept fp16
- 10-layer transformer configuration
- Decoupled Muon weight decay
- Overtone spectral embedding initialization
- Phase-transition residual mixing
- Demonstrates a compute-scaling negative result under a strict 10-minute train cap