PR #722

open

parameter golf submission - Julius

val_bpb

0.5588

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,302,060 bytes

Training Techniques

Architecture

LeakyReLU

Replaced ReLU(x)^2 with LeakyReLU(x, 0.5)^2 in all MLP blocks to avoid dead neurons while keeping squared non-negative outputs.

parameters: {"negative_slope":0.5}

BigramHash

Expanded BigramHashEmbedding capacity to reduce hash collisions.

parameters: {"buckets":3072}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Test-Time Training

LoRA TTT

parameters: {"epochs":8}

Weight Averaging

SWA

parameters: null

Quantization

int6

bits: 6

scope: all

Compression

zlib

level: null