PR #908

open

Non-record: Higher-Rank Output Heads — Standard Tied Head Wins on a Frontier 11L Baseline

by albertorkiveView on GitHub

val_bpb

1.1734

Architecture

Transformer

Optimizer

—

Artifact Size

16.8MB

Training Techniques

Weight Averaging

EMA

parameters: {"decay":0.997}

Architecture

XSA

Applied XSA to the last layers of the baseline model.

parameters: {"layers":4}

SmearGate

Enabled SmearGate in the baseline.

parameters: null

BigramHash

Enabled BigramHash with bucketed hashed embeddings.

parameters: {"buckets":2048,"dimensions":128}

Partial RoPE

Used partial rotary positional embeddings with NTK-aware scaling.

parameters: {"dimensions":16}

VE128

Enabled VE128 on later layers.

parameters: {"layers":[9,10]}

weight tying

Used a standard tied output head as the control baseline.

parameters: null

Regularization

LN Scale

parameters: null

Quantization

late QAT

bits: null

scope: model

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Novel Contributions

Non-record study of higher-rank output heads on a fixed frontier-aligned 11L baseline
Comparison of factorized heads, mixture-softmax heads, and a simplex head against a standard tied head
Finding that the standard tied head outperformed all tested higher-rank variants
Observation that mixture-softmax variants increased artifact size without improving score
Observation that the simplex head substantially reduced artifact size but collapsed validation performance