PR #1066

open

Add competitive 8xH100 run package (1.1259 bpb)

by adityakm24View on GitHub

val_bpb

1.1259

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,943,528 bytes

Training Techniques

Architecture

GQA

Grouped query attention used in the model.

parameters: {"num_heads":8,"num_kv_heads":4}

XSA

XSA applied to the last 4 layers.

parameters: {"layers":4}

Partial RoPE

Rotary position embeddings applied partially.

parameters: {"dimensions":16}

SmearGate

SmearGate used in the architecture.

parameters: null

BigramHash

Bigram hash component used in the architecture.

parameters: null

ValueEmbedding

Value embedding component used in the architecture.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"adamw":true}

Weight Averaging

EMA

parameters: null

Quantization

late QAT

bits: null

scope: mixed int6/int8

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"enabled":1,"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":0}

Compression

lzma

level: null

Sequence Length

sequence_length

train_length: 32768

eval_length: null

Other

other

8xH100 SXM training on Modal with a 600s wallclock cap.

parameters: {"gpus":8,"wallclock_seconds":600}

Novel Contributions

Competitive 8xH100 run package with full submission artifacts
Best legal score-first TTT exact metric of 1.12587738 bpb
Mixed int6/int8 quantized artifact kept under the 16MB submission cap
Use of GQA, XSA, Partial RoPE, SmearGate, BigramHash, and ValueEmbedding
EMA plus Muon/AdamW training stack with late QAT
Sliding-window evaluation combined with score-first test-time training