PR #799

open

Non-record Submission: SwiGLU 3x + Dynamic Wallclock Cosine

by yuvraajbainsView on GitHub

val_bpb

1.2005

Architecture

Transformer

Optimizer

—

Artifact Size

15,399,277 bytes

Training Techniques

Architecture

SwiGLU

Replaced ReLU² MLP activation with SwiGLU gating in the MLP layers.

parameters: null

MLP3x

Expanded MLP hidden size to 3x baseline to better utilize the 16MB artifact budget.

parameters: {"mlp_mult":3}

KV head count

Used fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

dynamic wallclock cosine warmdown

parameters: {"max_wallclock_seconds":600,"warmdown_fraction":0.4}

Weight Averaging

SWA

parameters: {"disabled":true}

Quantization

STE QAT / post-quant 6-bit

bits: 6

scope: all

Other

other

Double context length and larger batch token budget for training under the 600-second hardware-bound run.

parameters: {"train_batch_tokens":524288,"context_length":2048,"hardware":"8x H100 SXM"}