PR #1585

open

Record: Casefold Tokenizer + Parallel Residuals + Systems Optimization — val_bpb 1.0639 (3-seed mean)

by codemath3000View on GitHub
val_bpb
1.0639
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.98 MB

Training Techniques

Architecture
weight tying
Tied embeddings in the architecture.
parameters: null
LeakyReLU
Uses LeakyReLU squared activation in the MLP.
parameters: {"slope":0.5}
Partial RoPE
Applies partial rotary positional embeddings.
parameters: {"partial_ratio":"16/64"}
depth recurrence
Loops layers 3-5 with recurrence activated partway through training.
parameters: {"layers":[3,5],"activated_at_frac":0.35}
weight tying
Tied embeddings are used.
parameters: null
Gated Attention
Dual-lane parallel residual architecture with gated/parallel residual structure.
parameters: {"start_layer":8}
BigramHash
Uses a learned hash embedding during TTT.
parameters: {"dimensions":16384}
Regularization
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"sharded_reduce_scatter_all_gather":true,"newton_schulz_steps":5}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
warmdown
parameters: {"frac":0.72}
Test-Time Training
score-first TTT
parameters: {"chunk_tokens":32000,"epochs_per_chunk":3,"learning_rate":0.005,"momentum":0.9,"gradient_clipping":1}

Novel Contributions

  • Systems-level optimizations that increase throughput without changing the ML
  • Fused Muon kernel combining momentum update, Nesterov extrapolation, row normalization, and Newton-Schulz orthogonalization
  • Batched EMA using foreach operations
  • Reusable numpy preallocation in the data loader
  • Casefold v2 tokenizer with NFKC + lowercased retraining and verified byte counting
  • Parallel residual architecture with dual-lane residuals
  • Score-first chunk-based TTT with hash embedding