PR #985

open

Add 128-cluster baseline submission files

by danielweidinger2299-debugView on GitHub

val_bpb

1.3540

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.15 MB

Training Techniques

Architecture

BigramHash

Specialist MoE with 128 bigram-gating specialists (FastClusterGating) for clustered token routing.

parameters: {"specialists":128}

Transformer

800-dimensional Transformer with 6 layers and 10 attention heads per layer.

parameters: {"dimensions":800,"layers":6,"heads":10}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"use_for":"internal representations"}

Adam

weight_decay: null

momentum: null

other_params: {"use_for":"clusters"}

Weight Averaging

SWA

parameters: null

Quantization

mixed int5/int6

bits: 5

scope: MLP/rest

Compression

zlib

level: null

Other

other

Hard 600-second wallclock limit enforced in the training script.

parameters: {"max_wallclock_seconds":600}

Sequence Length

sequence_length

train_length: 1024

eval_length: null