PR #1957

open

record submission: nGPT

val_bpb

1.2313

Architecture

Transformer

Optimizer

—

Artifact Size

15987141 bytes

Training Techniques

Architecture

weight tying

Token embeddings and output embeddings are tied; embeddings are unit-norm and logits use a learned per-vocab scalar.

parameters: null

RoPE

Attention uses RoPE with per-head Q/K normalization and learned rescaling.

parameters: null

Gated Attention

Per-head Q/K vectors are L2-normalized after RoPE and rescaled by learned scalars before attention.

parameters: null

SwiGLU

MLP uses SwiGLU with learned per-channel reparameterization scalars on hidden activations.

parameters: null

other

Normalized GPT: hidden states and weight matrix rows are constrained to the unit hypersphere, with spherical interpolation-style residual updates.

parameters: null

Regularization

logit softcap

parameters: {"activation":"tanh"}

Quantization

int8

bits: 8

scope: serialized model

Compression

zlib

level: null

Normalized GPT with all hidden states constrained to the unit hypersphere
Unit-norm rows for all weight matrices with re-projection after each optimizer step
Spherical interpolation residual updates using learned per-dimension gating scalars
Per-head Q/K normalization with learned rescaling in attention
SwiGLU MLP with learned per-channel activation reparameterization
Tied unit-norm embeddings with learned per-vocab logit scaling and tanh softcap
Int8 plus zlib compressed submission artifact