PR #908
openNon-record: Higher-Rank Output Heads — Standard Tied Head Wins on a Frontier 11L Baseline
by albertorkiveView on GitHub
val_bpb
1.1734
Architecture
Transformer
Optimizer
—
Artifact Size
16.8MB
Training Techniques
Weight Averaging
EMA
parameters: {"decay":0.997}
Architecture
XSA
Applied XSA to the last layers of the baseline model.
parameters: {"layers":4}
SmearGate
Enabled SmearGate in the baseline.
parameters: null
BigramHash
Enabled BigramHash with bucketed hashed embeddings.
parameters: {"buckets":2048,"dimensions":128}
Partial RoPE
Used partial rotary positional embeddings with NTK-aware scaling.
parameters: {"dimensions":16}
VE128
Enabled VE128 on later layers.
parameters: {"layers":[9,10]}
weight tying
Used a standard tied output head as the control baseline.
parameters: null
Regularization
LN Scale
parameters: null
Quantization
late QAT
bits: null
scope: model
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Novel Contributions
- Non-record study of higher-rank output heads on a fixed frontier-aligned 11L baseline
- Comparison of factorized heads, mixture-softmax heads, and a simplex head against a standard tied head
- Finding that the standard tied head outperformed all tested higher-rank variants
- Observation that mixture-softmax variants increased artifact size without improving score
- Observation that the simplex head substantially reduced artifact size but collapsed validation performance