PR #1282

open

Slot Machine — 1.10350531 val_bpb (seed 444)

by newjordanView on GitHub
val_bpb
1.1035
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,536,878 B

Training Techniques

Architecture
XSA
11-layer XSA-all architecture used as the base model
parameters: {"layers":11}
weight tying
Standard embedding/lm_head tying
parameters: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Quantization
int6
bits: 6
scope: naive
Compression
brotli
level: 11
Other
other
byte-shuffle compression with stride=2
parameters: {"stride":2}
other
custom context-only SLOT test-time optimization
parameters: {"steps":8}
Evaluation
sliding window eval
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}

Novel Contributions

  • brotli-11 compression with byte-shuffle stride=2 to reduce model size
  • custom context-only SLOT 8-step test-time optimization
  • 11-layer XSA-all Rascal II training setup with parallel Muon and coprime loader
  • naive int6 quantization