PR #284

open

Add non-record local A100 PR60-stack reproduction

by DanishjeetSinghView on GitHub
val_bpb
1.4106
Architecture
Transformer
Optimizer
Muon
Artifact Size
11,124,153 bytes

Training Techniques

Evaluation
sliding window eval
parameters: {"stride":64}
Quantization
fp16
bits: 16
scope: tied embeddings
Architecture
tied embeddings
FP16 tied embedding export with tok_emb kept in fp16
parameters: null
weight tying
Tied embedding setup as part of the PR60-style stack
parameters: null
Transformer
10-layer transformer model
parameters: {"layers":10}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"decoupled_weight_decay":true}
Initialization
spectral init
Overtone spectral embedding initialization
resid mix
Phase-transition residual mixing
Regularization
weight decay
parameters: {"type":"decoupled"}
Compression
zlib
level: null

Novel Contributions

  • Non-record local reproduction of the PR60-style stack on 1x A100 hardware
  • Sliding-window final evaluation with stride 64
  • FP16 tied embedding export with tok_emb kept fp16
  • 10-layer transformer configuration
  • Decoupled Muon weight decay
  • Overtone spectral embedding initialization
  • Phase-transition residual mixing
  • Demonstrates a compute-scaling negative result under a strict 10-minute train cap