PR #1063

open

Add Compiled LeakyReLU2 + Slide64 Eval non-record submission

by SHN2004View on GitHub
val_bpb
1.3321
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.86 MB

Training Techniques

Architecture
LeakyReLU
Uses leakyrelu2 MLP activation in a 9-layer Transformer baseline.
parameters: {"layers":9,"width":512,"heads":8,"kv_heads":4,"mlp_mult":2}
weight tying
Tied embeddings are enabled.
parameters: null
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Optimizer
Muon
weight_decay: 0
momentum: null
other_params: {"adam_weight_decay":0,"embed_lr":0.05,"head_lr":0}
LR Schedule
warmdown
parameters: {"warmdown_iters":450}
Regularization
weight decay
parameters: {"muon_weight_decay":0,"adam_weight_decay":0}
Compression
zlib
level: null

Novel Contributions

  • Compiled LeakyReLU2 baseline with torch.compile enabled
  • Sliding-window evaluation with stride 64 instead of flat chunk evaluation
  • Demonstrated a clear validation bpb improvement from richer left-context scoring on the same 600-second training run
  • Documented a non-record single-GPU confirmation run and accompanying sweep context