PR #793

open

Blackwell local nonrecord

by pall23-mechView on GitHub

val_bpb

1.2500

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,794,840 bytes

Training Techniques

Architecture

tied embeddings

Token embeddings are tied.

parameters: null

KV head count

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"attention_heads":8,"kv_heads":4}

BigramHash

Uses a BigramHash feature path.

parameters: null

U-Net-style skip structure

Includes skip connections in a U-Net-like arrangement.

parameters: null

Transformer depth

Compact transformer with 10 layers.

parameters: {"layers":10}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Adam

weight_decay: null

momentum: null

other_params: {"scope":"embedding and scalar parameters"}

Evaluation

sliding window eval

parameters: null

Quantization

int6

bits: 6

scope: all

Compression

zstd

level: null

Other

other

Light pruning and repacking of the checkpoint to fit under the 16,000,000 byte size cap.

parameters: {"cap_bytes":16000000}

Novel Contributions

Local constrained-hardware run on an 8 GB Blackwell-class GPU
Use of train_merged_gpt_flagged.py for a non-record submission
Initial packed artifact was slightly over the 16,000,000 byte cap
Light pruning and repacking to produce an under-cap artifact
Reported compressed-model quality around 1.21 BPB before pruning and about 1.25 BPB after pruning