PR #493

open

Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309)

by parinzeeView on GitHub

val_bpb

1.1309

Architecture

Transformer

Optimizer

—

Artifact Size

15.8MB

Training Techniques

Architecture

XSA

Exclusive Self Attention on last 4 layers for better representation

parameters: {"layers":4}

LeakyReLU

Squared leaky ReLU with 0.5 negative slope

parameters: {"negative_slope":0.5,"power":2}

Partial RoPE

Only 16/64 dims use rotary embeddings

parameters: {"dims_used":16,"total_dims":64}

BigramHash

BigramHash token embeddings

parameters: {"hash_size":2048,"dim":128}

KV head count

8 heads / 4 KV heads (GQA)

parameters: {"heads":8,"kv_heads":4}

layers

Increased number of layers from 10 to 11

parameters: {"layers":11}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

int6

bits: 6

scope: all large weight matrices

Compression

zstd

level: 22

Other

other

Scale clamping fix with clamp_min(1/clip_range) to improve quantization quality

parameters: null

other

Smaller batch size (524288 tokens) to fit more training steps (~8200 steps in 600s)

parameters: {"batch_size_tokens":524288,"training_steps":8200,"training_time_seconds":600}

other

Higher learning rates for matrix and scalar parameters

parameters: {"matrix_lr":0.025,"scalar_lr":0.025}

LR Schedule

warmdown

parameters: {"warmdown_iters":4500}

Novel Contributions

Use of Exclusive Self Attention (XSA) on last 4 layers
LeakyReLU(0.5) squared activation function
Partial RoPE with rotary embeddings applied to only 16/64 dimensions
EMA weight averaging with decay=0.997
Int6 quantization applied to all large weight matrices
Scale clamping fix to improve quantization quality
Smaller batch size to enable more training steps within time limit
BigramHash token embeddings with hash size 2048 and dimension 128
Warmdown learning rate schedule with 4500 iterations
Higher learning rates for matrix and scalar parameters