PR #1113

open

Notable Non-Record: Learning Adapters on Random Linear Maps — 1.3705 BPB

by gowtham0992View on GitHub

val_bpb

1.3705

Architecture

Transformer

Optimizer

—

Artifact Size

5.19 MB

Training Techniques

Architecture

FrozenRandomLinearWithLoRA

Replaces attention and MLP projection weights with frozen random orthogonal linear maps plus LoRA adapters.

parameters: {"rank":32}

GQA

Uses grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Partial RoPE

Applies rotary position embeddings to only part of the head dimension.

parameters: {"dimensions":"16/64"}

XSA

Uses XSA in all layers.

parameters: {"layers":11}

BigramHash

Adds BigramHash features for token interactions.

parameters: {"size":2048}

SmearGate

Includes SmearGate as part of the architecture.

parameters: null

LeakyReLU

Uses LeakyReLU squared in the MLP.

parameters: {"slope":0.5}

Initialization

OrthoInit

Frozen weights are random orthogonal matrices generated via QR decomposition from deterministic seeds.

Quantization

GPTQ

bits: 6

scope: model weights

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: null

Regularization

LN scale

parameters: null

Novel Contributions

Frozen random orthogonal weights stored as non-persistent buffers, regenerated from seed at load time, costing 0 bytes in the artifact
LoRA adapters trained on top of random linear maps instead of learned base projections
Save/load roundtrip that reconstructs frozen weights deterministically with identical logits
Compact 5.19 MB submission with about 30M effective parameters