PR #436

open

Non-Record: 8L + BigramHash(12288) + Systematic HyperOpt (val_bpb=1.2392, 1xH100, 129 experiments)

by CrimsonSithriaView on GitHub

val_bpb

1.2392

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.9MB

Training Techniques

Architecture

BigramHash

Uses BigramHash with a 12288-bucket embedding and a 128-dimensional linear projection to reduce artifact size while preserving quality.

parameters: {"buckets":12288,"dim":128}

RoPE

Uses rotary positional embeddings with an optimized base for 2048 context.

parameters: {"base":50000}

tied embeddings

Uses FP16 tied embeddings.

parameters: null

KV head count

Uses grouped-query attention with 8 heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x/4x MLP

Uses 4x MLP expansion with relu^2 activation for throughput-limited training.

parameters: {"mlp_mult":4}

Optimizer

Muon

weight_decay: 0.048

momentum: null

other_params: {"matrix_lr":0.03,"scalar_lr":0.03,"tied_embed_lr":0.08,"grad_clip_norm":0.3,"muon_backend_steps":5}

Regularization

weight decay

parameters: {"weight_decay":0.048}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Evaluation

stride-based eval

parameters: {"stride":512}

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Compression

zlib

level: null

Initialization

overtone embedding init

Uses overtone embedding initialization with phase-transition residual mixing.

Other

other

Systematic hyperparameter optimization across 129 experiments to map scaling laws for learning rate, weight decay, batch size, and depth under single-GPU throughput constraints.

parameters: {"experiments":129,"total_compute_usd":19.47}

Novel Contributions

Systematic hyperparameter optimization across 129 experiments on a single H100
Mapped scaling laws for learning rate, weight decay, batch size, and model depth under throughput constraints
BigramHash with 128-dimensional projection to reduce artifact size with minimal BPB loss
Weight decay as a compression knob controlling int8+zlib artifact size
Batch size scaling on H100 showing 131K tokens outperforming 65K batch