PR #1585

open

Record: Casefold Tokenizer + Parallel Residuals + Systems Optimization — val_bpb 1.0639 (3-seed mean)

by codemath3000View on GitHub

val_bpb

1.0639

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.98 MB

Training Techniques

Architecture

weight tying

Tied embeddings in the architecture.

parameters: null

LeakyReLU

Uses LeakyReLU squared activation in the MLP.

parameters: {"slope":0.5}

Partial RoPE

Applies partial rotary positional embeddings.

parameters: {"partial_ratio":"16/64"}

depth recurrence

Loops layers 3-5 with recurrence activated partway through training.

parameters: {"layers":[3,5],"activated_at_frac":0.35}

weight tying

Tied embeddings are used.

parameters: null

Gated Attention

Dual-lane parallel residual architecture with gated/parallel residual structure.

parameters: {"start_layer":8}

BigramHash

Uses a learned hash embedding during TTT.

parameters: {"dimensions":16384}

Regularization

logit softcap

parameters: {"value":30}

Optimizer

Muon

weight_decay: null

momentum: 0.97

other_params: {"sharded_reduce_scatter_all_gather":true,"newton_schulz_steps":5}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings/scalars"}

Weight Averaging

EMA

parameters: {"decay":0.997}

LR Schedule

warmdown

parameters: {"frac":0.72}

Test-Time Training

score-first TTT

parameters: {"chunk_tokens":32000,"epochs_per_chunk":3,"learning_rate":0.005,"momentum":0.9,"gradient_clipping":1}

Novel Contributions

Systems-level optimizations that increase throughput without changing the ML
Fused Muon kernel combining momentum update, Nesterov extrapolation, row normalization, and Newton-Schulz orthogonalization
Batched EMA using foreach operations
Reusable numpy preallocation in the data loader
Casefold v2 tokenizer with NFKC + lowercased retraining and verified byte counting
Parallel residual architecture with dual-lane residuals
Score-first chunk-based TTT with hash embedding