PR #1108
opennGPT on the Hypersphere: Making Normalized Transformers Work at 16MB (Research)
by DbBestedView on GitHub
val_bpb
1.1502
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.9 MB
Training Techniques
Architecture
BigramHash
Bigram hashing input representation with 8192 vocabulary.
parameters: {"vocab":8192}
GQA
Grouped query attention with reduced KV heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
3x expansion MLP with LeakyReLU² activation.
parameters: {"expansion":3}
LeakyReLU
LeakyReLU squared activation in the MLP.
parameters: {"squared":true,"negative_slope":0.5}
U-Net skip connections
U-Net style skip connections in the block stack.
parameters: null
XSA
Extra attention applied in the last 4 layers.
parameters: {"layers":4}
Partial RoPE
Rotary position embeddings applied to a subset of head dimensions.
parameters: {"dimensions":16}
Quantization
int6
bits: 6
scope: all
QAT
bits: 6
scope: all
GPTQ
bits: 6
scope: all
Regularization
magnitude pruning
parameters: {"sparsity":0.078}
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025}
AdamW
weight_decay: null
momentum: null
other_params: {"embed_lr":0.035,"scalar_lr":0.025}
Weight Averaging
SWA
parameters: {"start":"last ~10% of warmdown"}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_steps":3500}
Initialization
resid mix
Modified residual mixing / interpolation behavior in nGPT; paper-faithful signed alpha was tested and found worse.
Other
other
Opaque custom autograd normalize function wrapped with allow_in_graph to prevent torch.compile precision compounding and graph breaks.
parameters: null
other
Post-dequantization renormalization to project int6-dequantized weights back onto the hypersphere.
parameters: null
other
Stochastic RYS / layer repetition during training to encourage refinable representations.
parameters: {"method":"SRYS"}
Test-Time Training
full TTT
parameters: {"learning_rate_range":[0.00005,0.002]}
Novel Contributions
- Made full nGPT trainable by fixing three interacting bugs that previously caused catastrophic underperformance.
- Identified and fixed a torch.compile precision compounding bug using an opaque custom autograd function with allow_in_graph.
- Introduced post-dequantization renormalization, dramatically reducing the int6 quantization gap for unit-norm weights.
- Mapped the nGPT design space with a broad ablation study across architecture, quantization, and training choices.
- Showed that structured-weight compression advantages can disappear at full training length.
- Demonstrated stronger stochastic RYS effects on the hypersphere due to geometric constraints preventing identity collapse.