PR #1282

open

Slot Machine — 1.10350531 val_bpb (seed 444)

val_bpb

1.1035

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,536,878 B

Training Techniques

Architecture

XSA

11-layer XSA-all architecture used as the base model

parameters: {"layers":11}

weight tying

Standard embedding/lm_head tying

parameters: null

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Quantization

int6

bits: 6

scope: naive

Compression

brotli

level: 11

Other

other

byte-shuffle compression with stride=2

parameters: {"stride":2}

other

custom context-only SLOT test-time optimization

parameters: {"steps":8}

Evaluation

sliding window eval

parameters: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}