PR #1505

open

Non-record: 11L 3x MLP Seq2048 — val_bpb 1.1791 (8xH100 SXM)

by Rohan-AbhilashView on GitHub
val_bpb
1.1791
Architecture
Transformer
Optimizer
Artifact Size
24.5MB

Training Techniques

Architecture
MLP3x
Increased MLP width from 2x to 3x with hidden dim 1536.
parameters: {"mlp_multiplier":3,"hidden_dim":1536}
weight tying
Tied input embeddings and output head.
parameters: null
KV head count
Used grouped KV heads in the transformer configuration.
parameters: {"num_heads":8,"num_kv_heads":4}
Transformer
Scaled baseline transformer to 11 layers.
parameters: {"layers":11}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":2000}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null

Novel Contributions

  • Scaled the baseline transformer to 11 layers
  • Increased MLP capacity to 3x width
  • Extended training sequence length to 2048
  • Used longer warmdown for better convergence
  • Improved validation BPB to 1.1791 after int8+zlib roundtrip
  • Identified int6 QAT + GPTQ + LZMA as the path to fit under the 16MB limit