PR #1684

open

Non-record: wip Random-Basis MLPs + LoRA

val_bpb

1.2554

Architecture

Transformer

Optimizer

Muon

Artifact Size

~8-10MB

Training Techniques

Architecture

ReLU²

Uses a ReLU-squared activated random-basis MLP as a nonlinear feature map.

parameters: null

LoRA

Adds low-rank adaptation layers on top of the random-basis MLP features.

parameters: {"rank":16}

SmearGate

Per-hidden diagonal gate applied to hidden features to amplify or suppress them.

parameters: {"dimensions":8192}

random basis MLP

Replaces learned MLP weights with deterministic random features generated from a seed.

parameters: {"mlp_mult":16,"layers":11,"dim":512}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA

parameters: null

Quantization

GPTQ

bits: null

scope: MLP B and gates

Compression

brotli

level: null

Stores random-basis MLP weights implicitly via a seed instead of explicit parameters.
Combines random feature maps with LoRA to recover task-specific capacity.
Uses a per-hidden diagonal gate to modulate random features.
Expands hidden width to 4x the baseline because the random basis is seed-generated and not stored in the artifact.