PR #1585
openRecord: Casefold Tokenizer + Parallel Residuals + Systems Optimization — val_bpb 1.0639 (3-seed mean)
by codemath3000View on GitHub
val_bpb
1.0639
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.98 MB
Training Techniques
Architecture
weight tying
Tied embeddings in the architecture.
parameters: null
LeakyReLU
Uses LeakyReLU squared activation in the MLP.
parameters: {"slope":0.5}
Partial RoPE
Applies partial rotary positional embeddings.
parameters: {"partial_ratio":"16/64"}
depth recurrence
Loops layers 3-5 with recurrence activated partway through training.
parameters: {"layers":[3,5],"activated_at_frac":0.35}
weight tying
Tied embeddings are used.
parameters: null
Gated Attention
Dual-lane parallel residual architecture with gated/parallel residual structure.
parameters: {"start_layer":8}
BigramHash
Uses a learned hash embedding during TTT.
parameters: {"dimensions":16384}
Regularization
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"sharded_reduce_scatter_all_gather":true,"newton_schulz_steps":5}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
warmdown
parameters: {"frac":0.72}
Test-Time Training
score-first TTT
parameters: {"chunk_tokens":32000,"epochs_per_chunk":3,"learning_rate":0.005,"momentum":0.9,"gradient_clipping":1}
Novel Contributions
- Systems-level optimizations that increase throughput without changing the ML
- Fused Muon kernel combining momentum update, Nesterov extrapolation, row normalization, and Newton-Schulz orthogonalization
- Batched EMA using foreach operations
- Reusable numpy preallocation in the data loader
- Casefold v2 tokenizer with NFKC + lowercased retraining and verified byte counting
- Parallel residual architecture with dual-lane residuals
- Score-first chunk-based TTT with hash embedding