PR #244

open

Non-record: leader-core valid-eval parity run + 1xH100 proxy screens

by simon-marcusView on GitHub
val_bpb
1.2064
Architecture
Transformer
Optimizer
Muon
Artifact Size
15294320 bytes

Training Techniques

Architecture
weight tying
Uses tied embeddings / tied token embedding settings in the leader-core merge candidate.
parameters: null
Quantization
int8
bits: 8
scope: model weights / export
int8
bits: 8
scope: token embeddings
Compression
zlib
level: null
Evaluation
validity-safe eval path
parameters: null
non-overlapping final eval
parameters: null
Initialization
OvertoneInit
Training core rooted in the official SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit recipe.
LR Schedule
warmdown
parameters: {"warmdown_iters":800}
Regularization
gradient clipping
parameters: {"grad_clip_norm":0.3}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: null
Other
other
Temperature-only post-quant search used after export.
parameters: null

Novel Contributions

  • Validity-safe merge rooted in the official SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit training core
  • Non-overlapping final evaluation
  • Stronger int8 export search
  • Temperature-only post-quant search
  • Saved RunPod local-disk parity run logs
  • Saved 1xH100 proxy-screen logs
  • Proxy ablations over learning-rate, warmdown, gradient clipping, Muon momentum, and token-embedding int8 settings
  • Identified warmdown800 + matrixlr006 as the strongest tested proxy combination