PR #1751

open

Parallel-Residual+SwiGLU+11layer

by Pravin-dev06View on GitHub
val_bpb
1.3565
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Architecture
depth recurrence
Increased model depth from 9 to 11 layers.
parameters: {"layers":11}
MLP3x
Replaced the standard ReLU^2 MLP with a 3-matrix SwiGLU feedforward block.
parameters: null
Gated Attention
Used parallel residual branches for attention and FFN, combining them as x = x + attn_out + ffwd_out.
parameters: null
Initialization
QK gain
Adjusted QK gain initialization to 2.5.
Quantization
int8
bits: 8
scope: evaluation

Novel Contributions

  • Scaled the model from 9 to 11 layers
  • Introduced parallel residual attention and FFN branches
  • Replaced ReLU^2 MLP with SwiGLU
  • Adjusted QK gain initialization to 2.5
  • Reported improved H100 validation bpb versus the 9-layer H100 run