PR #284

open

Add non-record local A100 PR60-stack reproduction

by DanishjeetSinghView on GitHub

val_bpb

1.4106

Architecture

Transformer

Optimizer

Muon

Artifact Size

11,124,153 bytes

Training Techniques

Evaluation

sliding window eval

parameters: {"stride":64}

Quantization

fp16

bits: 16

scope: tied embeddings

Architecture

tied embeddings

FP16 tied embedding export with tok_emb kept in fp16

parameters: null

weight tying

Tied embedding setup as part of the PR60-style stack

parameters: null

Transformer

10-layer transformer model

parameters: {"layers":10}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"decoupled_weight_decay":true}

Initialization

spectral init

Overtone spectral embedding initialization

resid mix

Phase-transition residual mixing

Regularization

weight decay

parameters: {"type":"decoupled"}

Compression

zlib

level: null

Non-record local reproduction of the PR60-style stack on 1x A100 hardware
Sliding-window final evaluation with stride 64
FP16 tied embedding export with tok_emb kept fp16
10-layer transformer configuration
Decoupled Muon weight decay
Overtone spectral embedding initialization
Phase-transition residual mixing
Demonstrates a compute-scaling negative result under a strict 10-minute train cap