PR #908

open

Non-record: Higher-Rank Output Heads — Standard Tied Head Wins on a Frontier 11L Baseline

by albertorkiveView on GitHub
val_bpb
1.1734
Architecture
Transformer
Optimizer
Artifact Size
16.8MB

Training Techniques

Weight Averaging
EMA
parameters: {"decay":0.997}
Architecture
XSA
Applied XSA to the last layers of the baseline model.
parameters: {"layers":4}
SmearGate
Enabled SmearGate in the baseline.
parameters: null
BigramHash
Enabled BigramHash with bucketed hashed embeddings.
parameters: {"buckets":2048,"dimensions":128}
Partial RoPE
Used partial rotary positional embeddings with NTK-aware scaling.
parameters: {"dimensions":16}
VE128
Enabled VE128 on later layers.
parameters: {"layers":[9,10]}
weight tying
Used a standard tied output head as the control baseline.
parameters: null
Regularization
LN Scale
parameters: null
Quantization
late QAT
bits: null
scope: model
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048

Novel Contributions

  • Non-record study of higher-rank output heads on a fixed frontier-aligned 11L baseline
  • Comparison of factorized heads, mixture-softmax heads, and a simplex head against a standard tied head
  • Finding that the standard tied head outperformed all tested higher-rank variants
  • Observation that mixture-softmax variants increased artifact size without improving score
  • Observation that the simplex head substantially reduced artifact size but collapsed validation performance