PR #493

open

Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309)

by parinzeeView on GitHub
val_bpb
1.1309
Architecture
Transformer
Optimizer
Artifact Size
15.8MB

Training Techniques

Architecture
XSA
Exclusive Self Attention on last 4 layers for better representation
parameters: {"layers":4}
LeakyReLU
Squared leaky ReLU with 0.5 negative slope
parameters: {"negative_slope":0.5,"power":2}
Partial RoPE
Only 16/64 dims use rotary embeddings
parameters: {"dims_used":16,"total_dims":64}
BigramHash
BigramHash token embeddings
parameters: {"hash_size":2048,"dim":128}
KV head count
8 heads / 4 KV heads (GQA)
parameters: {"heads":8,"kv_heads":4}
layers
Increased number of layers from 10 to 11
parameters: {"layers":11}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
int6
bits: 6
scope: all large weight matrices
Compression
zstd
level: 22
Other
other
Scale clamping fix with clamp_min(1/clip_range) to improve quantization quality
parameters: null
other
Smaller batch size (524288 tokens) to fit more training steps (~8200 steps in 600s)
parameters: {"batch_size_tokens":524288,"training_steps":8200,"training_time_seconds":600}
other
Higher learning rates for matrix and scalar parameters
parameters: {"matrix_lr":0.025,"scalar_lr":0.025}
LR Schedule
warmdown
parameters: {"warmdown_iters":4500}

Novel Contributions

  • Use of Exclusive Self Attention (XSA) on last 4 layers
  • LeakyReLU(0.5) squared activation function
  • Partial RoPE with rotary embeddings applied to only 16/64 dimensions
  • EMA weight averaging with decay=0.997
  • Int6 quantization applied to all large weight matrices
  • Scale clamping fix to improve quantization quality
  • Smaller batch size to enable more training steps within time limit
  • BigramHash token embeddings with hash size 2048 and dimension 128
  • Warmdown learning rate schedule with 4500 iterations
  • Higher learning rates for matrix and scalar parameters