val_bpb
1.8010
Architecture
Transformer
Optimizer
AdamW
Artifact Size
19,631,341 bytes
Training Techniques
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"RSOAdamW":true,"random_subspace_optimization":true}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
Random subspace optimization for learning adapters on Random Linear Maps
parameters: {"iterations":2000,"world_size":4,"grad_accum_steps":2,"train_batch_tokens":524288}
Novel Contributions
- Introduces a random subspace optimizer based on a simple baseline
- Applies random subspace optimization to learning adapters on Random Linear Maps
- Reports a non-record run that did not beat the baseline