PR #1617

open

[Non-Record] Single H100 16mb< 1.21bpb

by adityasasidharView on GitHub

val_bpb

1.2192

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.92 MB

Training Techniques

Architecture

GQA

Grouped query attention with fewer KV heads for parameter efficiency.

parameters: {"query_heads":8,"kv_heads":4}

Partial RoPE

Applies rotary position embeddings to only part of the head dimensions.

parameters: {"dimensions":32}

XSA

Cross-sequence attention enabled in the deepest layers.

parameters: {"xsa_last_n":2}

MLP3x

Uses a 3x MLP width expansion.

parameters: {"multiplier":3}

weight tying

Shares token embedding and LM head weights.

parameters: null

KV head count

Reduced KV head count relative to query heads.

parameters: {"query_heads":8,"kv_heads":4}

Initialization

OrthoInit

Orthogonal variance scaling for large linear weights.

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"direct_weight_decay":true}

AdamW

weight_decay: null

momentum: null

other_params: {"fused":true,"used_for":"embeddings"}

LR Schedule

warmdown

parameters: {"final_fraction":0.2,"budget_seconds":600}

Quantization

late QAT

bits: null

scope: final 15% of training

mixed int6/int8

bits: 6

scope: attention/MLP layers

Compression

zlib

level: 9

Evaluation

sliding window eval

parameters: {"stride":96}

Novel Contributions

Single-H100 pure neural architecture optimization under a 16MB artifact budget
8-layer Transformer with GQA, partial RoPE, XSA, and 3x MLP expansion
Late QAT with mixed int6/int8 storage to reduce artifact size
Wallclock-aware warmdown learning-rate schedule for a fixed compute budget
Sliding-window validation evaluation with stride 96
Zlib-compressed export of the final artifact