PR #1253

open

[non_record_16mb] 12L dim=448 LeakyReLU^2 BGVOCAB=2048 GH200 proxy (val_bpb=1.2326)

by OkropniakView on GitHub
val_bpb
1.2326
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.58 MB

Training Techniques

Architecture
LeakyReLU
Uses LeakyReLU squared activation instead of ReLU squared.
parameters: {"slope":0.1}
BigramHash
Adds bigram vocabulary and bigram embedding pathway.
parameters: {"bigram_vocab_size":2048,"bigram_dim":1024}
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
XSA
Applies XSA in the last layers.
parameters: {"last_n_layers":4}
VE128
Uses value residual / VE layers in the last layers with dynamic placement.
parameters: {"layers":"10,11"}
Weight Averaging
EMA
parameters: {"decay":0.995}
Quantization
int6
bits: 6
scope: all
Compression
zstd
level: 22
Evaluation
stride-based eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":500}
Optimizer
Muon
weight_decay: 0.025
momentum: 0.947
other_params: {"adam_wd":0.0014,"matrix_lr":0.068,"scalar_lr":0.042,"grad_clip_norm":0.308,"beta2":0.986,"momentum_warmup_steps":1644}

Novel Contributions

  • LeakyReLU squared activation replacing ReLU squared
  • Bigram vocabulary and embedding expansion to 2048
  • 12-layer, 448-dimension proxy-scale architecture
  • EMA decay calibrated to 0.995
  • Optuna v1 TPE hyperparameter tuning
  • int6 quantization with zstd-22 compression
  • Stride-64 evaluation on GH200 proxy hardware