PR #1120

RECORDopen

val_bpb 1.1099 (3-seed mean) Rascal

val_bpb

1.1099

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.5MB

Training Techniques

Architecture

XSA

XSA-all attention/sequence architecture variant

parameters: null

BigramHash

Bigram2048 token hashing/embedding component

parameters: {"dimensions":2048}

RoPE

Rotary positional embeddings with reduced dimension

parameters: {"dimensions":16}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Weight Averaging

SWA

parameters: null

Quantization

late QAT

bits: 6

scope: embeddings and 5 layers

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: null

Other

other

Parallel Muon optimizer with coprime loader and GPU mixer prefill for fast startup

parameters: {"coprime_loader":true,"gpu_prefill":true}