PR #373

closed

Record: SwiGLU + BigramHash + SWA, val_bpb=1.1634 (8xH100 verified)

by JoeProAIView on GitHub

val_bpb

1.1634

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.1 MB

Training Techniques

Architecture

SwiGLU

Replaced relu(x).square() FFN activation with SwiGLU.

parameters: null

BigramHash

Used BigramHash embeddings instead of standard token embeddings.

parameters: {"buckets":4096,"dim":128}

tied embeddings

Kept token embedding weights tied to the output head.

parameters: null

KV head count

Configured the transformer with 8 attention heads and 4 KV heads.

parameters: {"layers":10,"dim":512,"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: 0.02

momentum: null

other_params: null

Weight Averaging

SWA

parameters: {"every_steps":50,"start_fraction":0.5}

Quantization

int6

bits: 6

scope: all

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: null

eval_length: 960

LR Schedule

warmdown

parameters: {"warmdown_iters":3600}

Regularization

weight decay

parameters: {"weight_decay":0.02}

Other

other

FP16 embedding passthrough during quantization to reduce post-quantization degradation.

parameters: null

Novel Contributions

SwiGLU FFN activation discovered via automated search/GEPA
BigramHash embeddings with 4096 buckets and 128-dimensional embeddings
Stochastic Weight Averaging every 50 steps starting from 50% of training
FP16 embedding passthrough during quantization to reduce degradation
Sliding window evaluation with stride 64 for richer validation context
Warmdown and learning-rate tuning for the 10-minute wall-clock budget