PR #1253

open

[non_record_16mb] 12L dim=448 LeakyReLU^2 BGVOCAB=2048 GH200 proxy (val_bpb=1.2326)

by OkropniakView on GitHub

val_bpb

1.2326

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.58 MB

Training Techniques

Architecture

LeakyReLU

Uses LeakyReLU squared activation instead of ReLU squared.

parameters: {"slope":0.1}

BigramHash

Adds bigram vocabulary and bigram embedding pathway.

parameters: {"bigram_vocab_size":2048,"bigram_dim":1024}

GQA

Uses grouped query attention with fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

XSA

Applies XSA in the last layers.

parameters: {"last_n_layers":4}

VE128

Uses value residual / VE layers in the last layers with dynamic placement.

parameters: {"layers":"10,11"}

Weight Averaging

EMA

parameters: {"decay":0.995}

Quantization

int6

bits: 6

scope: all

Compression

zstd

level: 22

Evaluation

stride-based eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_steps":500}

Optimizer

Muon

weight_decay: 0.025

momentum: 0.947

other_params: {"adam_wd":0.0014,"matrix_lr":0.068,"scalar_lr":0.042,"grad_clip_norm":0.308,"beta2":0.986,"momentum_warmup_steps":1644}

Novel Contributions

LeakyReLU squared activation replacing ReLU squared
Bigram vocabulary and embedding expansion to 2048
12-layer, 448-dimension proxy-scale architecture
EMA decay calibrated to 0.995
Optuna v1 TPE hyperparameter tuning
int6 quantization with zstd-22 compression
Stride-64 evaluation on GH200 proxy hardware