PR #1670
openRecord: Casefold V4 Tokenizer + Multi-Phase Global SGD TTT — val_bpb 1.05970 (3-seed mean)
by dexhunterView on GitHub
val_bpb
1.0597
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.20 MB
Training Techniques
Architecture
weight tying
Tied input and output embeddings.
parameters: null
LeakyReLU
Uses LeakyReLU activation in the MLP.
parameters: {"slope":0.5}
Partial RoPE
Applies rotary position embeddings to a partial subset of dimensions.
parameters: {"dimensions":"16/64"}
depth recurrence
Reuses layers in encoder/decoder recurrence patterns.
parameters: {"encoder":[0,1,2,3,4,5,3,4],"decoder":[5,3,4,5,6,7,8,9,10]}
U-Net skip connections
Adds skip connections between layers in a U-Net-like pattern.
parameters: null
Gated Attention
Uses sigmoid-gated skip connections / gating in the residual path.
parameters: null
EMA
Exponential moving average of model weights during training.
parameters: {"decay":0.9965}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"variant":"MuonEq-R","newton_schulz_steps":5,"row_normalized":true}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Quantization
GPTQ
bits: 6
scope: attention/MLP
GPTQ
bits: 7
scope: embeddings
Compression
brotli
level: 11
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"method_variant":"Multi-Phase Global SGD","phases":3,"prefix_documents":2000,"learning_rate":0.001,"momentum":0.9,"gradient_clipping":1}
LR Schedule
warmdown
parameters: {"warmdown_fraction":0.75}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Novel Contributions
- Casefold tokenizer normalization with retrained SP8192 BPE on lowercased data
- Multi-phase global SGD test-time training with score-first ordering
- Phased TTT over 2000 validation prefix documents in 3 phases
- Trimmed GPTQ calibration/reserve setup for smaller artifacts
- Adaptive GPTQ clipping inherited from the PR #1530 base