val_bpb
1.2554
Architecture
Transformer
Optimizer
Muon
Artifact Size
~8-10MB
Training Techniques
Architecture
ReLU²
Uses a ReLU-squared activated random-basis MLP as a nonlinear feature map.
parameters: null
LoRA
Adds low-rank adaptation layers on top of the random-basis MLP features.
parameters: {"rank":16}
SmearGate
Per-hidden diagonal gate applied to hidden features to amplify or suppress them.
parameters: {"dimensions":8192}
random basis MLP
Replaces learned MLP weights with deterministic random features generated from a seed.
parameters: {"mlp_mult":16,"layers":11,"dim":512}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: null
Quantization
GPTQ
bits: null
scope: MLP B and gates
Compression
brotli
level: null
Novel Contributions
- Stores random-basis MLP weights implicitly via a seed instead of explicit parameters.
- Combines random feature maps with LoRA to recover task-specific capacity.
- Uses a per-hidden diagonal gate to modulate random features.
- Expands hidden width to 4x the baseline because the random basis is seed-generated and not stored in the artifact.