PR #793

open

Blackwell local nonrecord

by pall23-mechView on GitHub
val_bpb
1.2500
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,794,840 bytes

Training Techniques

Architecture
tied embeddings
Token embeddings are tied.
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"attention_heads":8,"kv_heads":4}
BigramHash
Uses a BigramHash feature path.
parameters: null
U-Net-style skip structure
Includes skip connections in a U-Net-like arrangement.
parameters: null
Transformer depth
Compact transformer with 10 layers.
parameters: {"layers":10}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Adam
weight_decay: null
momentum: null
other_params: {"scope":"embedding and scalar parameters"}
Evaluation
sliding window eval
parameters: null
Quantization
int6
bits: 6
scope: all
Compression
zstd
level: null
Other
other
Light pruning and repacking of the checkpoint to fit under the 16,000,000 byte size cap.
parameters: {"cap_bytes":16000000}

Novel Contributions

  • Local constrained-hardware run on an 8 GB Blackwell-class GPU
  • Use of train_merged_gpt_flagged.py for a non-record submission
  • Initial packed artifact was slightly over the 16,000,000 byte cap
  • Light pruning and repacking to produce an under-cap artifact
  • Reported compressed-model quality around 1.21 BPB before pruning and about 1.25 BPB after pruning