PR #1174

open

[Non-Record] 5L MLP×4 EMA=0.97 Optuna — GH200 proxy, val_bpb=1.3069 (int6+zlib)

by OkropniakView on GitHub
val_bpb
1.3069
Architecture
Transformer
Optimizer
Muon
Artifact Size
12.6 MB

Training Techniques

Quantization
int6
bits: 6
scope: all
Architecture
MLP3x
Increased MLP multiplier to 4.0 (hidden size 2048) in a 5-layer Transformer with GQA and bigram features.
parameters: {"num_layers":5,"model_dim":512,"mlp_mult":4,"num_heads":8,"num_kv_heads":4,"bigram_vocab_size":4096,"bigram_dim":1024}
GQA
Used grouped query attention with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
BigramHash
Added bigram vocabulary/dimension features.
parameters: {"bigram_vocab_size":4096,"bigram_dim":1024}
VE128
Applied VE layers fix to use the last two layers in the 5-layer model.
parameters: {"ve_layers":[3,4]}
Weight Averaging
EMA
parameters: {"decay":0.97}
Compression
zlib
level: null
Evaluation
stride-based eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":100}
Optimizer
Muon
weight_decay: 0.025
momentum: 0.947
other_params: {"adam_wd":0.0014,"matrix_lr":0.068,"scalar_lr":0.042,"grad_clip_norm":0.308,"muon_beta2":0.986,"muon_momentum_warmup_steps":1644}
Sequence Length
sequence_length
train_length: null
eval_length: 64

Novel Contributions

  • Proxy-scale GH200 MIG submission demonstrating active research methodology before H100 access
  • Optuna v1 TPE hyperparameter search with 25 trials
  • Warmdown schedule shortened from 3500 to 100 iterations
  • Fixed VE layer targeting for a 5-layer model
  • Calibrated EMA decay to 0.97 for short proxy runs
  • Used int6 quantization with zlib compression to fit artifact constraints