PR #483

closed

Track 10min_16mb: PR #287 family rerun at 585s wallclock (mean val_bpb=1.1346)

by tmustierView on GitHub

val_bpb

1.1346

Architecture

Transformer

Optimizer

Muon

Artifact Size

16,000,000 bytes

Training Techniques

Architecture

XSA

Uses XSA with the last 4 layers configured for the rerun family.

parameters: {"last_n":4}

BigramHash

Adds a bigram hashing component to the model.

parameters: {"vocab_size":2048,"dim":128}

MLP3x

Uses an expanded MLP width multiplier.

parameters: {"mlp_mult":3}

KV head count

Uses 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

weight tying

Uses tied embeddings.

parameters: null

Quantization

QAT

bits: 6

scope: all

Weight Averaging

EMA

parameters: {"decay":0.997}

Evaluation

stride-based eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_iters":3000,"warmup_steps":20}

Regularization

weight decay

parameters: {"muon_wd":0.04,"adam_wd":0.04}

Other

other

Uses FlashAttention 3 for training.

parameters: null

other

Uses int6 + zstd export to fit the artifact size limit.

parameters: null

Novel Contributions

3-seed rerun of the PR #287 family under a 585s wallclock cap
Use of FlashAttention 3 on 8×H100 SXM
Combination of XSA, EMA, BigramHash, and QAT
int6 + zstd export to keep all seeds under the 16MB artifact limit
Achieved mean val_bpb of 1.1346, beating merged SOTA 1.1428