PR #393

closed

Non-record: 7L + BigramHash Projection + Batch Scaling (val_bpb=1.2417, 1xH100)

by CrimsonSithriaView on GitHub

val_bpb

1.2417

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.5MB

Training Techniques

Architecture

BigramHash

BigramHash embedding with a linear projection to reduce artifact size while preserving quality.

parameters: {"buckets":8192,"projection_dim":128}

RoPE

Rotary positional embeddings with optimized base for the target context length.

parameters: {"base":50000}

tied embeddings

FP16 tied embeddings used to share input and output embeddings.

parameters: null

KV head count

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x/MLP4x

Uses a 4x MLP expansion with relu^2 activation for throughput-constrained training.

parameters: {"mlp_multiplier":4}

Optimizer

Muon

weight_decay: 0.025

momentum: null

other_params: {"matrix_lr":0.035,"scalar_lr":0.035,"embed_lr":0.09,"grad_clip":0.3}

LR Schedule

warmdown

parameters: {"warmdown_iters":3500}

Regularization

weight decay

parameters: {"weight_decay":0.025}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Evaluation

stride-based eval

parameters: {"stride":512}

Initialization

overtone embedding init

Non-standard embedding initialization combined with phase-transition residual mixing.

phase-transition residual mixing

Residual mixing strategy used alongside overtone embedding initialization.

Compression

zlib

level: null

Other

other

Systematic hyperparameter optimization across 111 experiments to tune LR, WD, and batch size for single-GPU throughput-constrained training.

parameters: {"experiments":111}

other

Increased batch size to 131K tokens per step to improve performance on H100.

parameters: {"train_batch_tokens":131072}

Novel Contributions

Systematic hyperparameter optimization across 111 experiments on a single GPU
Hyperparameter scaling laws showing LR, weight decay, and batch size must co-scale with GPU speed and step count
Using 131K tokens per step as a major lever on fast GPUs
BigramHash dimension-128 projection to save artifact space with minimal BPB loss
Observation that higher weight decay improves int8+zlib compression by shrinking weight magnitudes
Identification of negative results for EMA, SWA, SmearGate, orthogonal initialization, and magnitude pruning in the short-training regime