PR #985

open

Add 128-cluster baseline submission files

by danielweidinger2299-debugView on GitHub
val_bpb
1.3540
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.15 MB

Training Techniques

Architecture
BigramHash
Specialist MoE with 128 bigram-gating specialists (FastClusterGating) for clustered token routing.
parameters: {"specialists":128}
Transformer
800-dimensional Transformer with 6 layers and 10 attention heads per layer.
parameters: {"dimensions":800,"layers":6,"heads":10}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"use_for":"internal representations"}
Adam
weight_decay: null
momentum: null
other_params: {"use_for":"clusters"}
Weight Averaging
SWA
parameters: null
Quantization
mixed int5/int6
bits: 5
scope: MLP/rest
Compression
zlib
level: null
Other
other
Hard 600-second wallclock limit enforced in the training script.
parameters: {"max_wallclock_seconds":600}
Sequence Length
sequence_length
train_length: 1024
eval_length: null

Novel Contributions

  • 128-cluster specialist MoE architecture with FastClusterGating
  • Hybrid 5-bit MLP / 6-bit rest quantization scheme
  • Stochastic Weight Averaging for improved generalization
  • Muon optimizer for internal representations with Adam for clusters
  • Hard 600-second wallclock compliance guard
  • Artifact size kept under the 16MB limit