PR #584

closed

5 novel architecture ablations on SOTA baseline

val_bpb

1.1233

Architecture

Transformer

Optimizer

—

Artifact Size

16MB

Training Techniques

Weight Averaging

EMA

parameters: null

Quantization

GPTQ-lite

bits: null

scope: model weights

Architecture

SwiGLU

Replaces relu² MLP with SwiGLU MLP.

parameters: null

Adds learnable register/sink tokens to absorb attention sinks.

parameters: {"num_registers":4}

gated V-norm

Applies learned RMS normalization on values while Q and K are normalized.

parameters: null

mixture of softmax

Uses a mixture of softmax heads/experts to break the softmax rank bottleneck.

parameters: {"num_experts":2}

Evaluation

sliding window eval

parameters: {"window_size":256,"num_layers":5}

Provides five self-contained ablation training scripts built on the current SOTA baseline.
Introduces a SwiGLU MLP replacement for the relu² MLP.
Adds sliding-window attention to early layers to reduce FLOPs.
Adds learnable register/sink tokens to absorb attention sinks.
Introduces gated V-norm for values to potentially improve quantization robustness.
Explores mixture of softmax to address the softmax bottleneck and improve BPB.