PR #1174

open

[Non-Record] 5L MLP×4 EMA=0.97 Optuna — GH200 proxy, val_bpb=1.3069 (int6+zlib)

by OkropniakView on GitHub

val_bpb

1.3069

Architecture

Transformer

Optimizer

Muon

Artifact Size

12.6 MB

Training Techniques

Quantization

int6

bits: 6

scope: all

Architecture

MLP3x

Increased MLP multiplier to 4.0 (hidden size 2048) in a 5-layer Transformer with GQA and bigram features.

parameters: {"num_layers":5,"model_dim":512,"mlp_mult":4,"num_heads":8,"num_kv_heads":4,"bigram_vocab_size":4096,"bigram_dim":1024}

GQA

Used grouped query attention with fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

BigramHash

Added bigram vocabulary/dimension features.

parameters: {"bigram_vocab_size":4096,"bigram_dim":1024}

VE128

Applied VE layers fix to use the last two layers in the 5-layer model.

parameters: {"ve_layers":[3,4]}

Weight Averaging

EMA

parameters: {"decay":0.97}

Compression

zlib

level: null

Evaluation

stride-based eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_steps":100}

Optimizer

Muon

weight_decay: 0.025

momentum: 0.947

other_params: {"adam_wd":0.0014,"matrix_lr":0.068,"scalar_lr":0.042,"grad_clip_norm":0.308,"muon_beta2":0.986,"muon_momentum_warmup_steps":1644}

Sequence Length

sequence_length

train_length: null

eval_length: 64

Novel Contributions

Proxy-scale GH200 MIG submission demonstrating active research methodology before H100 access
Optuna v1 TPE hyperparameter search with 25 trials
Warmdown schedule shortened from 3500 to 100 iterations
Fixed VE layer targeting for a 5-layer model
Calibrated EMA decay to 0.97 for short proxy runs
Used int6 quantization with zlib compression to fit artifact constraints