PR #716

open

Add non-record 4090 warmdown submission

by SHN2004View on GitHub
val_bpb
1.4239
Architecture
Transformer
Optimizer
Artifact Size
14,624,248 bytes

Training Techniques

Architecture
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU2
Replaces the default MLP activation with a LeakyReLU-squared variant.
parameters: {"slope":0.5}
LR Schedule
warmdown
parameters: {"warmdown_iters":300}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Other
other
torch.compile enabled for training/evaluation speedup on a single RTX 4090.
parameters: {"hardware":"1x RTX 4090","wallclock_seconds":300}

Novel Contributions

  • Compiled execution with torch.compile
  • Quarter batch sizing to increase optimizer steps in fixed wall-clock time
  • Longer warmdown schedule (300 iterations)
  • LeakyReLU2 activation in the MLP
  • Single-GPU 4090 proxy search documenting a non-record 16MB submission