PR #2048

open

[Non-record submission] NativeBonsaiBinary 1 bit non-record submission (1.3551 BPB)

by kineticforgeView on GitHub

val_bpb

1.3551

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,087,895 bytes

Training Techniques

Architecture

weight tying

Tied embeddings were enabled.

parameters: null

GQA

Used grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":16,"kv_heads":4}

RoPE

Applied RoPE with a reduced rotary dimension.

parameters: {"rope_dim":16}

MLP3x

Used a SwiGLU MLP with 4x expansion.

parameters: {"mlp_mult":4}

Quantization

STE QAT

bits: 1

scope: grouped linear weights

Compression

lzma

level: null

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":500,"backend_steps":5,"matrix_lr":0.006,"scalar_lr":0.006,"tied_embed_lr":0.009}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

LR Schedule

warmdown

parameters: {"warmup_steps":5,"warmdown_iters":8000}

Regularization

logit softcap

parameters: {"softcap":30,"mode":"tanh"}

Other

other

Trained with PyTorch CUDA DDP across 4x NVIDIA H100 80GB GPUs with synchronized stopping and per-rank data shard offset.

parameters: {"world_size":4,"hardware":"4x NVIDIA H100 80GB"}

Novel Contributions

Native 1-bit grouped-binary transformer submission inspired by Bonsai/Qwen3 1-bit structure
Grouped binary STE linears with learned scales
DDP rank data offset to avoid duplicated minibatches across ranks
Synchronized wallclock stop across DDP ranks to prevent training hangs
Native packet accounting with exported LZMA-compressed packet plus counted code size