PR #799

open

Non-record Submission: SwiGLU 3x + Dynamic Wallclock Cosine

by yuvraajbainsView on GitHub
val_bpb
1.2005
Architecture
Transformer
Optimizer
Artifact Size
15,399,277 bytes

Training Techniques

Architecture
SwiGLU
Replaced ReLU² MLP activation with SwiGLU gating in the MLP layers.
parameters: null
MLP3x
Expanded MLP hidden size to 3x baseline to better utilize the 16MB artifact budget.
parameters: {"mlp_mult":3}
KV head count
Used fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
dynamic wallclock cosine warmdown
parameters: {"max_wallclock_seconds":600,"warmdown_fraction":0.4}
Weight Averaging
SWA
parameters: {"disabled":true}
Quantization
STE QAT / post-quant 6-bit
bits: 6
scope: all
Other
other
Double context length and larger batch token budget for training under the 600-second hardware-bound run.
parameters: {"train_batch_tokens":524288,"context_length":2048,"hardware":"8x H100 SXM"}

Novel Contributions

  • Migrating the baseline to a SwiGLU-based MLP architecture
  • Scaling the MLP to 3x width to fully utilize the 16MB artifact budget
  • Using a hardware-clock-based dynamic cosine warmdown schedule
  • Disabling SWA at the end of training to avoid degrading the final checkpoint
  • Applying straight-through estimators to simulate quantization-aware robustness