PR #244
openNon-record: leader-core valid-eval parity run + 1xH100 proxy screens
by simon-marcusView on GitHub
val_bpb
1.2064
Architecture
Transformer
Optimizer
Muon
Artifact Size
15294320 bytes
Training Techniques
Architecture
weight tying
Uses tied embeddings / tied token embedding settings in the leader-core merge candidate.
parameters: null
Quantization
int8
bits: 8
scope: model weights / export
int8
bits: 8
scope: token embeddings
Compression
zlib
level: null
Evaluation
validity-safe eval path
parameters: null
non-overlapping final eval
parameters: null
Initialization
OvertoneInit
Training core rooted in the official SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit recipe.
LR Schedule
warmdown
parameters: {"warmdown_iters":800}
Regularization
gradient clipping
parameters: {"grad_clip_norm":0.3}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: null
Other
other
Temperature-only post-quant search used after export.
parameters: null
Novel Contributions
- Validity-safe merge rooted in the official SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit training core
- Non-overlapping final evaluation
- Stronger int8 export search
- Temperature-only post-quant search
- Saved RunPod local-disk parity run logs
- Saved 1xH100 proxy-screen logs
- Proxy ablations over learning-rate, warmdown, gradient clipping, Muon momentum, and token-embedding int8 settings
- Identified warmdown800 + matrixlr006 as the strongest tested proxy combination