val_bpb
1.2314
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Initialization
spectral init
Pre-computed PCA-based token embedding positions from corpus co-occurrence statistics, selectively overriding a subset of token embeddings while leaving the rest at default Xavier initialization.
Architecture
weight tying
Tied input and output embeddings.
parameters: null
Novel Contributions
- Pre-computed PCA token positions from corpus co-occurrence data
- Selective override of 665/1024 token embeddings with structured positions
- Xavier-standardized overridden embeddings to preserve gradient flow
- Early training convergence improvement that did not persist through full training