PR #560

open

Non-record: 1x RTX PRO 6000 Blackwell 10L Int5-MLP (1.1935 BPB)

by Rohan5commitView on GitHub

val_bpb

1.1935

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,691,796 bytes

Training Techniques

Quantization

mixed int5/int6

bits: null

scope: null

Architecture

SmearGate

Incorporates SmearGate component in the model architecture

parameters: null

BigramHash

Uses BigramHash with 10240 buckets

parameters: {"buckets":10240}

MLP3x

Uses 3x MLP layers

parameters: {"layers":10}

Weight Averaging

SWA

parameters: {"type":"late SWA"}

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

Portable AMP dtype selection with bf16 on newer CUDA GPUs and fp16 fallback on older GPUs

parameters: null

other

SDPA backend probing with manual KV expansion fallback when native enable_gqa=True support is unavailable

parameters: null

other

Optional LOAD_MODEL_PATH restore before torch.compile() to support eval-only reloads

parameters: null

other

Single-GPU runtime tuning through environment variables: smaller batch size, longer wallclock, controllable sliding-window eval

parameters: {"train_batch_tokens":131072,"max_wallclock_seconds":2700,"eval_stride":64,"eval_batch_seqs":64}

Sequence Length

sequence_length

train_length: null

eval_length: null

Novel Contributions

Ported the merged 10L Int5MLP MuonWD04 SWA50 recipe to a single RTX PRO 6000 Blackwell GPU
Implemented portable AMP dtype selection with bf16 on newer GPUs and fp16 fallback on older GPUs
Added SDPA backend probing with manual KV expansion fallback for unsupported native enable_gqa=True
Enabled optional model restore before torch.compile() for eval-only reloads
Tuned single-GPU runtime with smaller batch size, longer wallclock, and controllable sliding-window evaluation
Maintained artifact size under 16MB with mixed int5/int6 quantization and zstd compression
Preserved most of the original architecture including 10 layers, 3x MLP, SmearGate, and BigramHash(10240)