val_bpb
1.2005
Architecture
Transformer
Optimizer
—
Artifact Size
15,399,277 bytes
Training Techniques
Architecture
SwiGLU
Replaced ReLU² MLP activation with SwiGLU gating in the MLP layers.
parameters: null
MLP3x
Expanded MLP hidden size to 3x baseline to better utilize the 16MB artifact budget.
parameters: {"mlp_mult":3}
KV head count
Used fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
dynamic wallclock cosine warmdown
parameters: {"max_wallclock_seconds":600,"warmdown_fraction":0.4}
Weight Averaging
SWA
parameters: {"disabled":true}
Quantization
STE QAT / post-quant 6-bit
bits: 6
scope: all
Other
other
Double context length and larger batch token budget for training under the 600-second hardware-bound run.
parameters: {"train_batch_tokens":524288,"context_length":2048,"hardware":"8x H100 SXM"}
Novel Contributions
- Migrating the baseline to a SwiGLU-based MLP architecture
- Scaling the MLP to 3x width to fully utilize the 16MB artifact budget
- Using a hardware-clock-based dynamic cosine warmdown schedule
- Disabling SWA at the end of training to avoid degrading the final checkpoint
- Applying straight-through estimators to simulate quantization-aware robustness