val_bpb
1.1527
Architecture
Transformer
Optimizer
—
Artifact Size
~15.95MB
Training Techniques
Architecture
MLP3x
Expanded MLP width to 3x.
parameters: null
XSA
Efficient XSA used on the last 4 layers.
parameters: {"layers":4}
Partial RoPE
Applied RoPE only partially.
parameters: null
VE128
Value residual with dimension 128 on the last layers.
parameters: {"layers":3,"dimensions":128}
BigramHash
Bigram hash embedding with smear gate.
parameters: {"vocab_size":2048}
SmearGate
Used with BigramHash.
parameters: null
mini-MoE
Multiple random up-projections with token-dependent routing over experts.
parameters: {"experts":1}
Regularization
LN scale
parameters: null
Quantization
STE QAT
bits: 6
scope: all
Weight Averaging
EMA + SWA
parameters: {"swa":"late"}
Compression
zlib
level: null
Test-Time Training
full TTT
parameters: null
Initialization
OrthoInit
Random MLP up-projections initialized with QR-based orthogonal matrices scaled by sqrt(d_in).
Novel Contributions
- Partially random MLP layers with frozen QR-initialized up-projections
- Learnable per-feature gain vectors on top of fixed random bases
- Reinvesting saved parameter budget into additional model depth
- Mini-MoE routing over multiple random up-projections
- Empirical comparison of QR, scaled normal, and Rademacher initialization for random projections