PR #1751

open

Parallel-Residual+SwiGLU+11layer

by Pravin-dev06View on GitHub

val_bpb

1.3565

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Architecture

depth recurrence

Increased model depth from 9 to 11 layers.

parameters: {"layers":11}

MLP3x

Replaced the standard ReLU^2 MLP with a 3-matrix SwiGLU feedforward block.

parameters: null

Gated Attention

Used parallel residual branches for attention and FFN, combining them as x = x + attn_out + ffwd_out.

parameters: null

Initialization

QK gain

Adjusted QK gain initialization to 2.5.

Quantization

int8

bits: 8

scope: evaluation