PR #1559

open

Single H100 10 min 16mb< 1.24 bpb

by adityasasidharView on GitHub

val_bpb

1.2498

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.2 MB

Training Techniques

Architecture

GQA

Uses grouped query attention with 8 query heads and 4 KV heads.

parameters: {"num_heads":8,"num_kv_heads":4}

Partial RoPE

Applies rotary position embeddings only to the first part of each head.

parameters: {"dimensions":32}

XSA

Enables XSA on the final layers of the model.

parameters: {"layers":2}

MLP3x

Increases MLP expansion from 2x to 3x.

parameters: {"multiplier":3}

Initialization

OrthoInit

Orthogonally initializes large linear layers and scales projection weights by depth.

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Optimizer

AdamW

weight_decay: 0.04

momentum: null

other_params: {"muon_wd":0.02,"embed_lr":0.04,"matrix_lr":0.032,"scalar_lr":0.032}

LR Schedule

warmdown

parameters: {"warmdown_iters":1200,"warmdown_last_frac":0.2,"warmup_steps":20}

Quantization

mixed int6/int8

bits: null

scope: model weights

STE QAT

bits: 8

scope: selected CastedLinear weights

Evaluation

sliding window eval

parameters: {"stride":128,"eval_batch_seqs":32}

Compression

zlib

level: null

Regularization

weight decay

parameters: {"adam_wd":0.04,"muon_wd":0.02}

Novel Contributions

Single-H100 run with sliding-window validation to improve measured bpb
Mixed-precision int6/int8 export format to fit under the 16MB limit
STE QAT applied late in training for selected weights
Partial RoPE with rope_dims=32
XSA enabled only on the final layers
OrthoInit-style initialization with depth-scaled projection weights
Warmdown driven by wallclock fraction
Sliding-window evaluation via reusable forward_logits path