PR #1439

open

Add LoRA exploration non-record archive

val_bpb

1.2639

Architecture

Transformer

Optimizer

—

Artifact Size

14,724,469 B

Training Techniques

Other

other

Learned Adapters / LoRA exploration on random linear maps, including baseline learned-adapter training and several LoRA variants.

parameters: null

other

rsLoRA variant with non-uniform ranks across attention and MLP components.

parameters: {"rank_qk_early":48,"rank_qk_late":64,"rank_vo_early":64,"rank_vo_late":96,"rank_mlp_early":96,"rank_mlp_late":128}

other

LoRA-GA initialization variant.

parameters: null

other

AdaLoRA variant with adaptive rank allocation.

parameters: {"init_rank":128,"target_avg_rank":80}

other

RandLoRA variant using frozen random bases with learned combinations.

parameters: {"rand_basis_rank":16,"num_bases_qo":32,"num_bases_kv":16,"num_bases_mlp":32,"late_layer_mult":2}

other

Targeted/selective LoRA placement by layer depth and module type (attention or MLP only).

parameters: {"layers":"4-6","targets":["attn","mlp"]}

LR Schedule

warmdown

parameters: {"warmdown_steps":720}

warmdown

parameters: {"warmdown_steps":4000}

warmdown

parameters: {"warmdown_steps":2500}

warmdown

parameters: {"warmdown_steps":3000}

Compression

zlib

level: null

Non-record archive exploring learned adapters on random linear maps.
Comparison of baseline learned adapters against rsLoRA, LoRA-GA, AdaLoRA, and RandLoRA variants.
Study of selective LoRA placement by layer depth and module type.
Reported best selective attention-only LoRA result on layers 4-6 with val_bpb 1.2639.
Artifact size reduction from selective LoRA placement while maintaining competitive validation performance.