PR #466

open

Record: 11L EMA + BigramHash(12288) + Mixed Int5 + FA3 (1.1354)

by simonbissonnetteView on GitHub

val_bpb

1.1354

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,967,704 bytes

Training Techniques

Architecture

BigramHash

Adds a bigram hashing component to the model.

parameters: {"buckets":12288,"dim":128}

tied embeddings

Uses tied input and output embeddings.

parameters: null

KV head count

Uses grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

Uses an MLP multiplier of 3.0.

parameters: {"multiplier":3}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

mixed int5

bits: 5

scope: mixed low-bit quantization

Evaluation

stride-based sliding window eval

parameters: {"stride":64}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025,"scalar_lr":0.025}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_frac":0.48}

Regularization

weight decay

parameters: {"value":0.04}

Other

other

Uses FlashAttention-3 via kernels-community/flash-attn3, which fetches the FA3 kernel package at runtime.

parameters: null

Novel Contributions

11-layer, 512-dimensional GQA Transformer submission
BigramHash with 12288 buckets and 128-dimensional embeddings
EMA with decay 0.997
Mixed low-bit quantization using 5-bit attention and bigram quantization
Stride-64 sliding evaluation
FlashAttention-3 runtime path via kernels-community/flash-attn3