PR #560

open

Non-record: 1x RTX PRO 6000 Blackwell 10L Int5-MLP (1.1935 BPB)

by Rohan5commitView on GitHub
val_bpb
1.1935
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,691,796 bytes

Training Techniques

Quantization
mixed int5/int6
bits: null
scope: null
Architecture
SmearGate
Incorporates SmearGate component in the model architecture
parameters: null
BigramHash
Uses BigramHash with 10240 buckets
parameters: {"buckets":10240}
MLP3x
Uses 3x MLP layers
parameters: {"layers":10}
Weight Averaging
SWA
parameters: {"type":"late SWA"}
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Portable AMP dtype selection with bf16 on newer CUDA GPUs and fp16 fallback on older GPUs
parameters: null
other
SDPA backend probing with manual KV expansion fallback when native enable_gqa=True support is unavailable
parameters: null
other
Optional LOAD_MODEL_PATH restore before torch.compile() to support eval-only reloads
parameters: null
other
Single-GPU runtime tuning through environment variables: smaller batch size, longer wallclock, controllable sliding-window eval
parameters: {"train_batch_tokens":131072,"max_wallclock_seconds":2700,"eval_stride":64,"eval_batch_seqs":64}
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • Ported the merged 10L Int5MLP MuonWD04 SWA50 recipe to a single RTX PRO 6000 Blackwell GPU
  • Implemented portable AMP dtype selection with bf16 on newer GPUs and fp16 fallback on older GPUs
  • Added SDPA backend probing with manual KV expansion fallback for unsupported native enable_gqa=True
  • Enabled optional model restore before torch.compile() for eval-only reloads
  • Tuned single-GPU runtime with smaller batch size, longer wallclock, and controllable sliding-window evaluation
  • Maintained artifact size under 16MB with mixed int5/int6 quantization and zstd compression
  • Preserved most of the original architecture including 10 layers, 3x MLP, SmearGate, and BigramHash(10240)