val_bpb
1.3321
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.86 MB
Training Techniques
Architecture
LeakyReLU
Uses leakyrelu2 MLP activation in a 9-layer Transformer baseline.
parameters: {"layers":9,"width":512,"heads":8,"kv_heads":4,"mlp_mult":2}
weight tying
Tied embeddings are enabled.
parameters: null
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Optimizer
Muon
weight_decay: 0
momentum: null
other_params: {"adam_weight_decay":0,"embed_lr":0.05,"head_lr":0}
LR Schedule
warmdown
parameters: {"warmdown_iters":450}
Regularization
weight decay
parameters: {"muon_weight_decay":0,"adam_weight_decay":0}
Compression
zlib
level: null
Novel Contributions
- Compiled LeakyReLU2 baseline with torch.compile enabled
- Sliding-window evaluation with stride 64 instead of flat chunk evaluation
- Demonstrated a clear validation bpb improvement from richer left-context scoring on the same 600-second training run
- Documented a non-record single-GPU confirmation run and accompanying sweep context