PR #584

closed

5 novel architecture ablations on SOTA baseline

val_bpb
1.1233
Architecture
Transformer
Optimizer
Artifact Size
16MB

Training Techniques

Weight Averaging
EMA
parameters: null
Quantization
GPTQ-lite
bits: null
scope: model weights
Architecture
SwiGLU
Replaces relu² MLP with SwiGLU MLP.
parameters: null
register tokens
Adds learnable register/sink tokens to absorb attention sinks.
parameters: {"num_registers":4}
gated V-norm
Applies learned RMS normalization on values while Q and K are normalized.
parameters: null
mixture of softmax
Uses a mixture of softmax heads/experts to break the softmax rank bottleneck.
parameters: {"num_experts":2}
Evaluation
sliding window eval
parameters: {"window_size":256,"num_layers":5}

Novel Contributions

  • Provides five self-contained ablation training scripts built on the current SOTA baseline.
  • Introduces a SwiGLU MLP replacement for the relu² MLP.
  • Adds sliding-window attention to early layers to reduce FLOPs.
  • Adds learnable register/sink tokens to absorb attention sinks.
  • Introduces gated V-norm for values to potentially improve quantization robustness.
  • Explores mixture of softmax to address the softmax bottleneck and improve BPB.