PR #1901
openRecord: 0.8335 BPB — DualHash + AdaMuon + MoE + SDClip (3-seed mean)
by Karen042009View on GitHub
val_bpb
0.8335
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.13 MiB
Training Techniques
Architecture
BigramHash
Dual-token hash skip connection using two hash tables for bigram-style skip features.
parameters: {"tables":2,"table_shape":"2048x16","multipliers":[8191,104729]}
depth recurrence
Recurrent layer structure with a repeated loop over layers and learnable LayerScale coefficients.
parameters: {"pattern":[0,1,2,3,4,5,3,4,5]}
MoE
Hybrid mixture-of-experts with shared and specialized experts; top-1 routing plus shared expert output.
parameters: {"shared_experts":1,"specialized_experts":3,"top_k":1}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"AdaMuon","rms_preconditioning":true,"riemannian_newton_schulz_orthogonalization":true}
Quantization
int6
bits: 6
scope: artifact export
Test-Time Training
score-first TTT
parameters: {"passes":2}
Compression
lzma
level: null
Regularization
layerwise LN scale
parameters: {"learnable_layerscale":true,"main_branch_init":1,"recurrent_branch_init":0.1}
Novel Contributions
- DualTokenHashSkip with dual hash tables for bigram skip connections
- LayerScale recurrence with a repeated layer loop
- SharedMoE with one shared expert and three specialized experts
- AdaMuon optimizer with RMS pre-conditioning and Newton-Schulz orthogonalization
- Dynamic MSE SDClip search for optimal INT6 export
- Score-first two-pass test-time training