PR #436
openNon-Record: 8L + BigramHash(12288) + Systematic HyperOpt (val_bpb=1.2392, 1xH100, 129 experiments)
by CrimsonSithriaView on GitHub
val_bpb
1.2392
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.9MB
Training Techniques
Architecture
BigramHash
Uses BigramHash with a 12288-bucket embedding and a 128-dimensional linear projection to reduce artifact size while preserving quality.
parameters: {"buckets":12288,"dim":128}
RoPE
Uses rotary positional embeddings with an optimized base for 2048 context.
parameters: {"base":50000}
tied embeddings
Uses FP16 tied embeddings.
parameters: null
KV head count
Uses grouped-query attention with 8 heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x/4x MLP
Uses 4x MLP expansion with relu^2 activation for throughput-limited training.
parameters: {"mlp_mult":4}
Optimizer
Muon
weight_decay: 0.048
momentum: null
other_params: {"matrix_lr":0.03,"scalar_lr":0.03,"tied_embed_lr":0.08,"grad_clip_norm":0.3,"muon_backend_steps":5}
Regularization
weight decay
parameters: {"weight_decay":0.048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Evaluation
stride-based eval
parameters: {"stride":512}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Compression
zlib
level: null
Initialization
overtone embedding init
Uses overtone embedding initialization with phase-transition residual mixing.
Other
other
Systematic hyperparameter optimization across 129 experiments to map scaling laws for learning rate, weight decay, batch size, and depth under single-GPU throughput constraints.
parameters: {"experiments":129,"total_compute_usd":19.47}
Novel Contributions
- Systematic hyperparameter optimization across 129 experiments on a single H100
- Mapped scaling laws for learning rate, weight decay, batch size, and model depth under throughput constraints
- BigramHash with 128-dimensional projection to reduce artifact size with minimal BPB loss
- Weight decay as a compression knob controlling int8+zlib artifact size
- Batch size scaling on H100 showing 131K tokens outperforming 65K batch