PR #1066

open

Add competitive 8xH100 run package (1.1259 bpb)

by adityakm24View on GitHub
val_bpb
1.1259
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,943,528 bytes

Training Techniques

Architecture
GQA
Grouped query attention used in the model.
parameters: {"num_heads":8,"num_kv_heads":4}
XSA
XSA applied to the last 4 layers.
parameters: {"layers":4}
Partial RoPE
Rotary position embeddings applied partially.
parameters: {"dimensions":16}
SmearGate
SmearGate used in the architecture.
parameters: null
BigramHash
Bigram hash component used in the architecture.
parameters: null
ValueEmbedding
Value embedding component used in the architecture.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adamw":true}
Weight Averaging
EMA
parameters: null
Quantization
late QAT
bits: null
scope: mixed int6/int8
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"enabled":1,"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":0}
Compression
lzma
level: null
Sequence Length
sequence_length
train_length: 32768
eval_length: null
Other
other
8xH100 SXM training on Modal with a 600s wallclock cap.
parameters: {"gpus":8,"wallclock_seconds":600}

Novel Contributions

  • Competitive 8xH100 run package with full submission artifacts
  • Best legal score-first TTT exact metric of 1.12587738 bpb
  • Mixed int6/int8 quantized artifact kept under the 16MB submission cap
  • Use of GQA, XSA, Partial RoPE, SmearGate, BigramHash, and ValueEmbedding
  • EMA plus Muon/AdamW training stack with late QAT
  • Sliding-window evaluation combined with score-first test-time training