PR #1253
open[non_record_16mb] 12L dim=448 LeakyReLU^2 BGVOCAB=2048 GH200 proxy (val_bpb=1.2326)
by OkropniakView on GitHub
val_bpb
1.2326
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.58 MB
Training Techniques
Architecture
LeakyReLU
Uses LeakyReLU squared activation instead of ReLU squared.
parameters: {"slope":0.1}
BigramHash
Adds bigram vocabulary and bigram embedding pathway.
parameters: {"bigram_vocab_size":2048,"bigram_dim":1024}
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
XSA
Applies XSA in the last layers.
parameters: {"last_n_layers":4}
VE128
Uses value residual / VE layers in the last layers with dynamic placement.
parameters: {"layers":"10,11"}
Weight Averaging
EMA
parameters: {"decay":0.995}
Quantization
int6
bits: 6
scope: all
Compression
zstd
level: 22
Evaluation
stride-based eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":500}
Optimizer
Muon
weight_decay: 0.025
momentum: 0.947
other_params: {"adam_wd":0.0014,"matrix_lr":0.068,"scalar_lr":0.042,"grad_clip_norm":0.308,"beta2":0.986,"momentum_warmup_steps":1644}
Novel Contributions
- LeakyReLU squared activation replacing ReLU squared
- Bigram vocabulary and embedding expansion to 2048
- 12-layer, 448-dimension proxy-scale architecture
- EMA decay calibrated to 0.995
- Optuna v1 TPE hyperparameter tuning
- int6 quantization with zstd-22 compression
- Stride-64 evaluation on GH200 proxy hardware