PR #1548

closed

Frozen Random Backbone + LoRA Adapters (1.322 BPB)

by dljr-githubView on GitHub

val_bpb

1.3220

Architecture

Transformer

Optimizer

Muon

Artifact Size

13.5MB

Training Techniques

Architecture

depth recurrence

Layers 3-5 are looped multiple times, reusing the same weights across passes.

parameters: {"layers":[3,4,5],"loops":2}

XSA

Cross-segment attention applied across layers.

parameters: null

Partial RoPE

Rotary position embeddings applied to only part of the head dimension with NTK scaling.

parameters: {"dimensions":16}

U-Net skip connections

Learned skip connections between encoder and decoder halves.

parameters: null

LeakyReLU

Uses LeakyReLU(0.5)^2 as the MLP activation.

parameters: null

GQA

Uses fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"row_normalization":true,"momentum_warmup":true}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Quantization

GPTQ

bits: 6

scope: adapter weights

int8

bits: 8

scope: embeddings

Compression

brotli

level: 11

Evaluation

sliding window eval

parameters: {"stride":256}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_frac":0.72}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer_idx+1)"}

Other

other

Frozen random backbone reconstructed deterministically from seed=42; only LoRA adapters, embeddings, and control tensors are serialized.

parameters: {"seed":42,"lora_rank":304}

Novel Contributions

Frozen random backbone reconstructed from a deterministic seed so backbone weights take 0 bytes in the artifact
Rank-304 LoRA adapters on every linear layer with only adapter weights serialized
Depth recurrence combined with adapters, reusing adapter weights across repeated layers
Disabling EMA for random adapters to avoid regression from averaging adapter_B toward zero initialization
GPTQ int6 plus brotli serialization pipeline that fits within the 16MB limit