PR #1753

open

Non-record: Lottery Ticket Hypothesis with a few float parameters

by Abhishek-Dalvi410View on GitHub

val_bpb

1.2917

Architecture

Transformer

Optimizer

Adam

Artifact Size

14,891,849 bytes

Training Techniques

Architecture

weight tying

Tied embeddings are enabled for the transformer.

parameters: {"tied_embeddings":1}

GQA

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

Initialization

custom random init

Deterministic seed-derived float32 random initialization with Xavier uniform for embeddings/attention/projections, He uniform for MLP pre-ReLU layers, and zeros for 1D parameters.

Regularization

magnitude pruning

parameters: {"binary_masks":true,"learned_mask_scores":true}

Other

other

Lottery Ticket Hypothesis / supermask training: freeze all weights at random init and learn per-element binary masks plus a few continuous scale parameters.

parameters: {"mask_temperature_start":1,"mask_temperature_end":0.5,"mask_lr":0.1}

other

Train a small set of continuous scale parameters alongside masks, including attn_scale, mlp_scale, head_scale, and pre_logit_scale.

parameters: {"scale_params":["attn_scale","mlp_scale","head_scale","pre_logit_scale"]}

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":200}

Optimizer

Adam

weight_decay: null

momentum: null

other_params: {"lr":0.1}

Novel Contributions

Freezes all transformer weights at a seed-derived random initialization and learns only binary masks over them.
Ships a compact artifact consisting of a uint32 seed, bit-packed masks, and a few fp16 scale parameters.
Demonstrates a lottery-ticket/supermask regime at large scale with no stored weight values.
Uses deterministic float32 CPU initialization so the frozen network can be regenerated bit-exactly from the seed.
Applies temperature-annealed straight-through binary masking with a small set of learned continuous scale knobs.