PR #373

closed

Record: SwiGLU + BigramHash + SWA, val_bpb=1.1634 (8xH100 verified)

by JoeProAIView on GitHub
val_bpb
1.1634
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.1 MB

Training Techniques

Architecture
SwiGLU
Replaced relu(x).square() FFN activation with SwiGLU.
parameters: null
BigramHash
Used BigramHash embeddings instead of standard token embeddings.
parameters: {"buckets":4096,"dim":128}
tied embeddings
Kept token embedding weights tied to the output head.
parameters: null
KV head count
Configured the transformer with 8 attention heads and 4 KV heads.
parameters: {"layers":10,"dim":512,"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.02
momentum: null
other_params: null
Weight Averaging
SWA
parameters: {"every_steps":50,"start_fraction":0.5}
Quantization
int6
bits: 6
scope: all
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: null
eval_length: 960
LR Schedule
warmdown
parameters: {"warmdown_iters":3600}
Regularization
weight decay
parameters: {"weight_decay":0.02}
Other
other
FP16 embedding passthrough during quantization to reduce post-quantization degradation.
parameters: null

Novel Contributions

  • SwiGLU FFN activation discovered via automated search/GEPA
  • BigramHash embeddings with 4096 buckets and 128-dimensional embeddings
  • Stochastic Weight Averaging every 50 steps starting from 50% of training
  • FP16 embedding passthrough during quantization to reduce degradation
  • Sliding window evaluation with stride 64 for richer validation context
  • Warmdown and learning-rate tuning for the 10-minute wall-clock budget