PR #1113

open

Notable Non-Record: Learning Adapters on Random Linear Maps — 1.3705 BPB

by gowtham0992View on GitHub
val_bpb
1.3705
Architecture
Transformer
Optimizer
Artifact Size
5.19 MB

Training Techniques

Architecture
FrozenRandomLinearWithLoRA
Replaces attention and MLP projection weights with frozen random orthogonal linear maps plus LoRA adapters.
parameters: {"rank":32}
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
Applies rotary position embeddings to only part of the head dimension.
parameters: {"dimensions":"16/64"}
XSA
Uses XSA in all layers.
parameters: {"layers":11}
BigramHash
Adds BigramHash features for token interactions.
parameters: {"size":2048}
SmearGate
Includes SmearGate as part of the architecture.
parameters: null
LeakyReLU
Uses LeakyReLU squared in the MLP.
parameters: {"slope":0.5}
Initialization
OrthoInit
Frozen weights are random orthogonal matrices generated via QR decomposition from deterministic seeds.
Quantization
GPTQ
bits: 6
scope: model weights
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
Regularization
LN scale
parameters: null

Novel Contributions

  • Frozen random orthogonal weights stored as non-persistent buffers, regenerated from seed at load time, costing 0 bytes in the artifact
  • LoRA adapters trained on top of random linear maps instead of learned base projections
  • Save/load roundtrip that reconstructs frozen weights deterministically with identical logits
  • Compact 5.19 MB submission with about 30M effective parameters