val_bpb
1.3565
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Architecture
depth recurrence
Increased model depth from 9 to 11 layers.
parameters: {"layers":11}
MLP3x
Replaced the standard ReLU^2 MLP with a 3-matrix SwiGLU feedforward block.
parameters: null
Gated Attention
Used parallel residual branches for attention and FFN, combining them as x = x + attn_out + ffwd_out.
parameters: null
Initialization
QK gain
Adjusted QK gain initialization to 2.5.
Quantization
int8
bits: 8
scope: evaluation
Novel Contributions
- Scaled the model from 9 to 11 layers
- Introduced parallel residual attention and FFN branches
- Replaced ReLU^2 MLP with SwiGLU
- Adjusted QK gain initialization to 2.5
- Reported improved H100 validation bpb versus the 9-layer H100 run