val_bpb
1.2639
Architecture
Transformer
Optimizer
—
Artifact Size
14,724,469 B
Training Techniques
Other
other
Learned Adapters / LoRA exploration on random linear maps, including baseline learned-adapter training and several LoRA variants.
parameters: null
other
rsLoRA variant with non-uniform ranks across attention and MLP components.
parameters: {"rank_qk_early":48,"rank_qk_late":64,"rank_vo_early":64,"rank_vo_late":96,"rank_mlp_early":96,"rank_mlp_late":128}
other
LoRA-GA initialization variant.
parameters: null
other
AdaLoRA variant with adaptive rank allocation.
parameters: {"init_rank":128,"target_avg_rank":80}
other
RandLoRA variant using frozen random bases with learned combinations.
parameters: {"rand_basis_rank":16,"num_bases_qo":32,"num_bases_kv":16,"num_bases_mlp":32,"late_layer_mult":2}
other
Targeted/selective LoRA placement by layer depth and module type (attention or MLP only).
parameters: {"layers":"4-6","targets":["attn","mlp"]}
LR Schedule
warmdown
parameters: {"warmdown_steps":720}
warmdown
parameters: {"warmdown_steps":4000}
warmdown
parameters: {"warmdown_steps":2500}
warmdown
parameters: {"warmdown_steps":3000}
Compression
zlib
level: null
Novel Contributions
- Non-record archive exploring learned adapters on random linear maps.
- Comparison of baseline learned adapters against rsLoRA, LoRA-GA, AdaLoRA, and RandLoRA variants.
- Study of selective LoRA placement by layer depth and module type.
- Reported best selective attention-only LoRA result on layers 4-6 with val_bpb 1.2639.
- Artifact size reduction from selective LoRA placement while maintaining competitive validation performance.