PR #1228

open

Non Record: Partially Random MLP

by meinlebenswerkView on GitHub

val_bpb

1.1527

Architecture

Transformer

Optimizer

—

Artifact Size

~15.95MB

Training Techniques

Architecture

MLP3x

Expanded MLP width to 3x.

parameters: null

XSA

Efficient XSA used on the last 4 layers.

parameters: {"layers":4}

Partial RoPE

Applied RoPE only partially.

parameters: null

VE128

Value residual with dimension 128 on the last layers.

parameters: {"layers":3,"dimensions":128}

BigramHash

Bigram hash embedding with smear gate.

parameters: {"vocab_size":2048}

SmearGate

Used with BigramHash.

parameters: null

mini-MoE

Multiple random up-projections with token-dependent routing over experts.

parameters: {"experts":1}

Regularization

LN scale

parameters: null

Quantization

STE QAT

bits: 6

scope: all

Weight Averaging

EMA + SWA

parameters: {"swa":"late"}

Compression

zlib

level: null

Test-Time Training

full TTT

parameters: null

Initialization

OrthoInit

Random MLP up-projections initialized with QR-based orthogonal matrices scaled by sqrt(d_in).

Novel Contributions

Partially random MLP layers with frozen QR-initialized up-projections
Learnable per-feature gain vectors on top of fixed random bases
Reinvesting saved parameter budget into additional model depth
Mini-MoE routing over multiple random up-projections
Empirical comparison of QR, scaled normal, and Rademacher initialization for random projections