PR #1434

open

Non-record: 131 Systematic Experiments — 1.5207 BPB on RTX 4000 Ada

by ranausmanaiView on GitHub
val_bpb
1.5207
Architecture
Transformer
Optimizer
Artifact Size
~14MB

Training Techniques

Compression
zlib
level: null
LR Schedule
cosine decay
parameters: null
warmup
parameters: {"warmup_fraction":0.382}
Regularization
logit softcap
parameters: {"value":15}
weight decay
parameters: null
Architecture
RoPE
Adjusted rotary positional embedding base for sharper positional attention
parameters: {"base":5000}
parallel blocks
PaLM-style parallel block structure
parameters: null
MLP activation
SiLU squared activation in the MLP
parameters: {"activation":"silu2"}
GQA
Grouped query attention with reduced KV heads
parameters: {"num_kv_heads":2}
weight tying
Untied embeddings / disabled tied embeddings
parameters: null
decoder depth
Pure decoder configuration with no encoder layers
parameters: {"encoder_layers":0}
model width
9-layer, 512-dimensional model
parameters: {"layers":9,"dimensions":512}

Novel Contributions

  • 131 systematic experiments across 18 phases on a single RTX 4000 Ada GPU
  • Achieved 1.5207 BPB under a ~$5 compute budget
  • Identified parallel blocks, untied embeddings, pure decoder, GQA with 2 KV heads, SiLU squared activation, logit softcap 15, tidal LR, and RoPE base 5000 as beneficial
  • Demonstrated that several gradient tricks, regularizers, and dynamic architecture ideas were harmful or impractical on a single GPU
  • Provided a complete experiment log and reproducible training script