PR #1256

open

Non-record: LeakyReLU(0.5)^2 on SmearGate + BigramHash + Int6 stack (1.1444 bpb)

by oidebrettView on GitHub

val_bpb

1.1444

Architecture

Transformer

Optimizer

Muon

Artifact Size

~16.0 MB

Training Techniques

Architecture

SmearGate

Uses SmearGate in the model stack.

parameters: null

BigramHash

Adds BigramHash embeddings.

parameters: {"dimensions":128,"hash_size":4096}

MLP3x

Uses a 3x MLP expansion.

parameters: null

LeakyReLU

Uses LeakyReLU(0.5)^2 instead of ReLU^2 in the MLP activation.

parameters: {"negative_slope":0.5}

Quantization

QAT

bits: 6

scope: all

Weight Averaging

SWA

parameters: {"checkpoints_averaged":30}

Compression

zstd

level: 22

Initialization

OrthoInit

Orthogonal initialization.

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Novel Contributions

Replaced ReLU^2 with LeakyReLU(0.5)^2 in the MLP activation
Built on the SmearGate + BigramHash + Int6 QAT + SWA stack
Reported a small improvement over the base stack (1.1459 to 1.1444)