PR #2058

open

Non-record submission: Learned Adapters on Random Linear Maps

val_bpb

1.1971

Architecture

Transformer

Optimizer

—

Artifact Size

15418110

Training Techniques

Architecture

MLP3x

Expanded MLP width to 3x and used learned adapters on random linear maps instead of storing full MLP matrices.

parameters: {"mlp_mult":3,"rank":160}

weight tying

Used tied embeddings for the input/output embedding weights.

parameters: null

Quantization

mixed int6/int8

bits: 6

scope: block weights

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":512}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Other

other

Used learned adapters with random linear maps generated from a seed, storing only low-rank adapter matrices to reduce artifact size.

parameters: {"layers":12}