PR #1549

open

Non-record: Frozen Random Backbone + Rank-304 LoRA Adapters (val_bpb 1.3220)

by dljr-githubView on GitHub

val_bpb

1.3220

Architecture

Transformer

Optimizer

Muon

Artifact Size

13.5MB

Training Techniques

Architecture

depth recurrence

Looped layers 3-5 reuse the same adapter weights across 3 passes to increase gradient signal.

parameters: {"layers":[3,4,5],"passes":3}

XSA

XSA applied across all layers.

parameters: null

Partial RoPE

RoPE applied partially rather than across the full head dimension.

parameters: {"dimensions":16}

U-Net skip connections

U-Net style skip connections added to the model.

parameters: null

weight tying

Tied embeddings are used.

parameters: null

LeakyReLU

MLP uses LeakyReLU(0.5)^2 activation.

parameters: {"negative_slope":0.5}

Quantization

GPTQ

bits: 6

scope: adapter matrices

Compression

brotli

level: 11

Evaluation

sliding window eval

parameters: {"stride":256}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Optimizer

AdamW

weight_decay: 0.095

momentum: null

other_params: {"embeddings_and_scalars":true}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_frac":0.72}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Novel Contributions

Frozen random backbone reconstructed from a deterministic seed at load time, requiring no serialized backbone weights
Rank-304 LoRA adapters applied to all linear layers
Depth recurrence on layers 3-5 with shared adapter weights across multiple passes
GPTQ int6 quantization with brotli compression for adapter-only artifact serialization
EMA disabled for adapters because it regresses performance by averaging adapter_B toward zero initialization