PR #1957

open

record submission: nGPT

by mhlov000111View on GitHub
val_bpb
1.2313
Architecture
Transformer
Optimizer
Artifact Size
15987141 bytes

Training Techniques

Architecture
weight tying
Token embeddings and output embeddings are tied; embeddings are unit-norm and logits use a learned per-vocab scalar.
parameters: null
RoPE
Attention uses RoPE with per-head Q/K normalization and learned rescaling.
parameters: null
Gated Attention
Per-head Q/K vectors are L2-normalized after RoPE and rescaled by learned scalars before attention.
parameters: null
SwiGLU
MLP uses SwiGLU with learned per-channel reparameterization scalars on hidden activations.
parameters: null
other
Normalized GPT: hidden states and weight matrix rows are constrained to the unit hypersphere, with spherical interpolation-style residual updates.
parameters: null
Regularization
logit softcap
parameters: {"activation":"tanh"}
Quantization
int8
bits: 8
scope: serialized model
Compression
zlib
level: null

Novel Contributions

  • Normalized GPT with all hidden states constrained to the unit hypersphere
  • Unit-norm rows for all weight matrices with re-projection after each optimizer step
  • Spherical interpolation residual updates using learned per-dimension gating scalars
  • Per-head Q/K normalization with learned rescaling in attention
  • SwiGLU MLP with learned per-channel activation reparameterization
  • Tied unit-norm embeddings with learned per-vocab logit scaling and tanh softcap
  • Int8 plus zlib compressed submission artifact