PR #1559

open

Single H100 10 min 16mb< 1.24 bpb

by adityasasidharView on GitHub
val_bpb
1.2498
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.2 MB

Training Techniques

Architecture
GQA
Uses grouped query attention with 8 query heads and 4 KV heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Partial RoPE
Applies rotary position embeddings only to the first part of each head.
parameters: {"dimensions":32}
XSA
Enables XSA on the final layers of the model.
parameters: {"layers":2}
MLP3x
Increases MLP expansion from 2x to 3x.
parameters: {"multiplier":3}
Initialization
OrthoInit
Orthogonally initializes large linear layers and scales projection weights by depth.
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Optimizer
AdamW
weight_decay: 0.04
momentum: null
other_params: {"muon_wd":0.02,"embed_lr":0.04,"matrix_lr":0.032,"scalar_lr":0.032}
LR Schedule
warmdown
parameters: {"warmdown_iters":1200,"warmdown_last_frac":0.2,"warmup_steps":20}
Quantization
mixed int6/int8
bits: null
scope: model weights
STE QAT
bits: 8
scope: selected CastedLinear weights
Evaluation
sliding window eval
parameters: {"stride":128,"eval_batch_seqs":32}
Compression
zlib
level: null
Regularization
weight decay
parameters: {"adam_wd":0.04,"muon_wd":0.02}

Novel Contributions

  • Single-H100 run with sliding-window validation to improve measured bpb
  • Mixed-precision int6/int8 export format to fit under the 16MB limit
  • STE QAT applied late in training for selected weights
  • Partial RoPE with rope_dims=32
  • XSA enabled only on the final layers
  • OrthoInit-style initialization with depth-scaled projection weights
  • Warmdown driven by wallclock fraction
  • Sliding-window evaluation via reusable forward_logits path