PR #1617

open

[Non-Record] Single H100 16mb< 1.21bpb

by adityasasidharView on GitHub
val_bpb
1.2192
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.92 MB

Training Techniques

Architecture
GQA
Grouped query attention with fewer KV heads for parameter efficiency.
parameters: {"query_heads":8,"kv_heads":4}
Partial RoPE
Applies rotary position embeddings to only part of the head dimensions.
parameters: {"dimensions":32}
XSA
Cross-sequence attention enabled in the deepest layers.
parameters: {"xsa_last_n":2}
MLP3x
Uses a 3x MLP width expansion.
parameters: {"multiplier":3}
weight tying
Shares token embedding and LM head weights.
parameters: null
KV head count
Reduced KV head count relative to query heads.
parameters: {"query_heads":8,"kv_heads":4}
Initialization
OrthoInit
Orthogonal variance scaling for large linear weights.
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"direct_weight_decay":true}
AdamW
weight_decay: null
momentum: null
other_params: {"fused":true,"used_for":"embeddings"}
LR Schedule
warmdown
parameters: {"final_fraction":0.2,"budget_seconds":600}
Quantization
late QAT
bits: null
scope: final 15% of training
mixed int6/int8
bits: 6
scope: attention/MLP layers
Compression
zlib
level: 9
Evaluation
sliding window eval
parameters: {"stride":96}

Novel Contributions

  • Single-H100 pure neural architecture optimization under a 16MB artifact budget
  • 8-layer Transformer with GQA, partial RoPE, XSA, and 3x MLP expansion
  • Late QAT with mixed int6/int8 storage to reduce artifact size
  • Wallclock-aware warmdown learning-rate schedule for a fixed compute budget
  • Sliding-window validation evaluation with stride 96
  • Zlib-compressed export of the final artifact