PR #393

closed

Non-record: 7L + BigramHash Projection + Batch Scaling (val_bpb=1.2417, 1xH100)

by CrimsonSithriaView on GitHub
val_bpb
1.2417
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.5MB

Training Techniques

Architecture
BigramHash
BigramHash embedding with a linear projection to reduce artifact size while preserving quality.
parameters: {"buckets":8192,"projection_dim":128}
RoPE
Rotary positional embeddings with optimized base for the target context length.
parameters: {"base":50000}
tied embeddings
FP16 tied embeddings used to share input and output embeddings.
parameters: null
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x/MLP4x
Uses a 4x MLP expansion with relu^2 activation for throughput-constrained training.
parameters: {"mlp_multiplier":4}
Optimizer
Muon
weight_decay: 0.025
momentum: null
other_params: {"matrix_lr":0.035,"scalar_lr":0.035,"embed_lr":0.09,"grad_clip":0.3}
LR Schedule
warmdown
parameters: {"warmdown_iters":3500}
Regularization
weight decay
parameters: {"weight_decay":0.025}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Evaluation
stride-based eval
parameters: {"stride":512}
Initialization
overtone embedding init
Non-standard embedding initialization combined with phase-transition residual mixing.
phase-transition residual mixing
Residual mixing strategy used alongside overtone embedding initialization.
Compression
zlib
level: null
Other
other
Systematic hyperparameter optimization across 111 experiments to tune LR, WD, and batch size for single-GPU throughput-constrained training.
parameters: {"experiments":111}
other
Increased batch size to 131K tokens per step to improve performance on H100.
parameters: {"train_batch_tokens":131072}

Novel Contributions

  • Systematic hyperparameter optimization across 111 experiments on a single GPU
  • Hyperparameter scaling laws showing LR, weight decay, and batch size must co-scale with GPU speed and step count
  • Using 131K tokens per step as a major lever on fast GPUs
  • BigramHash dimension-128 projection to save artifact space with minimal BPB loss
  • Observation that higher weight decay improves int8+zlib compression by shrinking weight magnitudes
  • Identification of negative results for EMA, SWA, SmearGate, orthogonal initialization, and magnitude pruning in the short-training regime