PR #884

open

submission: LeakyReLU2 + TrigramHashEmbedding (1.1448 bpb)

by BhatiaUdayView on GitHub

val_bpb

1.1448

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.6 MB

Training Techniques

Architecture

TrigramHash

Hash-based trigram embedding that XOR-hashes 3 consecutive token IDs into 2048 buckets and projects to model dimension.

parameters: {"vocab_size":2048,"trigram_dim":48,"project_dim":512}

LeakyReLU

Uses LeakyReLU(0.5)^2 in the MLP to preserve negative gradient flow.

parameters: {"negative_slope":0.5}

Quantization

GPTQ-lite

bits: 6

scope: model weights

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Weight Averaging

EMA + Tight SWA

parameters: {"decay":0.997,"swa_interval_steps":50}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"warmup_momentum_start":0.92,"warmup_steps":1500}

Regularization

LN scale

parameters: {"schedule":"1/sqrt(layer+1)"}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Other

other

Uses gradient accumulation scaled by world size to keep effective batch size constant across 1-GPU and 8-GPU runs.

parameters: {"grad_accum_formula":"8 // world_size"}

Novel Contributions

TrigramHashEmbedding extending BigramHash to 3-token context
XOR prime hashing of trigrams into 2048 buckets
LeakyReLU(0.5)^2 MLP activation
Proportional wallclock validation on 1×H100 to match 8×H100 training trajectory
EMA + Tight SWA with GPTQ-lite int6 and LZMA compression