PR #1481
openNon-record: ALBERT-Style Low-Rank Embedding Factorisation (ablation study, 1×H100)
by Cayton-TechView on GitHub
val_bpb
1.3440
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.09 MB
Training Techniques
Architecture
weight tying
Uses tied input/output embeddings with a low-rank embedding bottleneck and projection layers to factorize the token embedding matrix.
parameters: {"vocab_size":1024,"model_dim":512,"bottleneck_ranks":[64,128]}
other
ALBERT-style low-rank embedding factorisation replacing a full embedding table with embedding bottleneck plus linear projection.
parameters: {"ranks_tested":[64,128]}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Novel Contributions
- Ablation study of ALBERT-style low-rank embedding factorisation at a 1024-token vocabulary scale
- Comparison of rank-64 and rank-128 embedding bottlenecks against an unmodified baseline
- Identification that low-rank embedding factorisation does not improve BPB for this small vocabulary
- Implementation and explanation of the tied-embedding dimension mismatch fix for factorized embeddings