PR #1108

open

nGPT on the Hypersphere: Making Normalized Transformers Work at 16MB (Research)

by DbBestedView on GitHub

val_bpb

1.1502

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.9 MB

Training Techniques

Architecture

BigramHash

Bigram hashing input representation with 8192 vocabulary.

parameters: {"vocab":8192}

GQA

Grouped query attention with reduced KV heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

3x expansion MLP with LeakyReLU² activation.

parameters: {"expansion":3}

LeakyReLU

LeakyReLU squared activation in the MLP.

parameters: {"squared":true,"negative_slope":0.5}

U-Net skip connections

U-Net style skip connections in the block stack.

parameters: null

XSA

Extra attention applied in the last 4 layers.

parameters: {"layers":4}

Partial RoPE

Rotary position embeddings applied to a subset of head dimensions.

parameters: {"dimensions":16}

Quantization

int6

bits: 6

scope: all

QAT

bits: 6

scope: all

GPTQ

bits: 6

scope: all

Regularization

magnitude pruning

parameters: {"sparsity":0.078}

logit softcap

parameters: {"value":30}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025}

AdamW

weight_decay: null

momentum: null

other_params: {"embed_lr":0.035,"scalar_lr":0.025}

Weight Averaging

SWA

parameters: {"start":"last ~10% of warmdown"}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_steps":3500}

Initialization

resid mix

Modified residual mixing / interpolation behavior in nGPT; paper-faithful signed alpha was tested and found worse.

Other

other

Opaque custom autograd normalize function wrapped with allow_in_graph to prevent torch.compile precision compounding and graph breaks.

parameters: null

other

Post-dequantization renormalization to project int6-dequantized weights back onto the hypersphere.

parameters: null

other

Stochastic RYS / layer repetition during training to encourage refinable representations.

parameters: {"method":"SRYS"}

Test-Time Training

full TTT

parameters: {"learning_rate_range":[0.00005,0.002]}

Novel Contributions

Made full nGPT trainable by fixing three interacting bugs that previously caused catastrophic underperformance.
Identified and fixed a torch.compile precision compounding bug using an opaque custom autograd function with allow_in_graph.
Introduced post-dequantization renormalization, dramatically reducing the int6 quantization gap for unit-norm weights.
Mapped the nGPT design space with a broad ablation study across architecture, quantization, and training choices.
Showed that structured-weight compression advantages can disappear at full training length.
Demonstrated stronger stochastic RYS effects on the hypersphere due to geometric constraints preventing identity collapse.