PR #1481

open

Non-record: ALBERT-Style Low-Rank Embedding Factorisation (ablation study, 1×H100)

val_bpb

1.3440

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.09 MB

Training Techniques

Architecture

weight tying

Uses tied input/output embeddings with a low-rank embedding bottleneck and projection layers to factorize the token embedding matrix.

parameters: {"vocab_size":1024,"model_dim":512,"bottleneck_ranks":[64,128]}

other

ALBERT-style low-rank embedding factorisation replacing a full embedding table with embedding bottleneck plus linear projection.

parameters: {"ranks_tested":[64,128]}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Ablation study of ALBERT-style low-rank embedding factorisation at a 1024-token vocabulary scale
Comparison of rank-64 and rank-128 embedding bottlenecks against an unmodified baseline
Identification that low-rank embedding factorisation does not improve BPB for this small vocabulary
Implementation and explanation of the tied-embedding dimension mismatch fix for factorized embeddings