val_bpb
1.2313
Architecture
Transformer
Optimizer
—
Artifact Size
15987141 bytes
Training Techniques
Architecture
weight tying
Token embeddings and output embeddings are tied; embeddings are unit-norm and logits use a learned per-vocab scalar.
parameters: null
RoPE
Attention uses RoPE with per-head Q/K normalization and learned rescaling.
parameters: null
Gated Attention
Per-head Q/K vectors are L2-normalized after RoPE and rescaled by learned scalars before attention.
parameters: null
SwiGLU
MLP uses SwiGLU with learned per-channel reparameterization scalars on hidden activations.
parameters: null
other
Normalized GPT: hidden states and weight matrix rows are constrained to the unit hypersphere, with spherical interpolation-style residual updates.
parameters: null
Regularization
logit softcap
parameters: {"activation":"tanh"}
Quantization
int8
bits: 8
scope: serialized model
Compression
zlib
level: null
Novel Contributions
- Normalized GPT with all hidden states constrained to the unit hypersphere
- Unit-norm rows for all weight matrices with re-projection after each optimizer step
- Spherical interpolation residual updates using learned per-dimension gating scalars
- Per-head Q/K normalization with learned rescaling in attention
- SwiGLU MLP with learned per-channel activation reparameterization
- Tied unit-norm embeddings with learned per-vocab logit scaling and tanh softcap
- Int8 plus zlib compressed submission artifact