val_bpb
1.1233
Architecture
Transformer
Optimizer
—
Artifact Size
16MB
Training Techniques
Weight Averaging
EMA
parameters: null
Quantization
GPTQ-lite
bits: null
scope: model weights
Architecture
SwiGLU
Replaces relu² MLP with SwiGLU MLP.
parameters: null
register tokens
Adds learnable register/sink tokens to absorb attention sinks.
parameters: {"num_registers":4}
gated V-norm
Applies learned RMS normalization on values while Q and K are normalized.
parameters: null
mixture of softmax
Uses a mixture of softmax heads/experts to break the softmax rank bottleneck.
parameters: {"num_experts":2}
Evaluation
sliding window eval
parameters: {"window_size":256,"num_layers":5}
Novel Contributions
- Provides five self-contained ablation training scripts built on the current SOTA baseline.
- Introduces a SwiGLU MLP replacement for the relu² MLP.
- Adds sliding-window attention to early layers to reduce FLOPs.
- Adds learnable register/sink tokens to absorb attention sinks.
- Introduces gated V-norm for values to potentially improve quantization robustness.
- Explores mixture of softmax to address the softmax bottleneck and improve BPB.