PR #1481

open

Non-record: ALBERT-Style Low-Rank Embedding Factorisation (ablation study, 1×H100)

by Cayton-TechView on GitHub
val_bpb
1.3440
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.09 MB

Training Techniques

Architecture
weight tying
Uses tied input/output embeddings with a low-rank embedding bottleneck and projection layers to factorize the token embedding matrix.
parameters: {"vocab_size":1024,"model_dim":512,"bottleneck_ranks":[64,128]}
other
ALBERT-style low-rank embedding factorisation replacing a full embedding table with embedding bottleneck plus linear projection.
parameters: {"ranks_tested":[64,128]}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024

Novel Contributions

  • Ablation study of ALBERT-style low-rank embedding factorisation at a 1024-token vocabulary scale
  • Comparison of rank-64 and rank-128 embedding bottlenecks against an unmodified baseline
  • Identification that low-rank embedding factorisation does not improve BPB for this small vocabulary
  • Implementation and explanation of the tied-embedding dimension mismatch fix for factorized embeddings