PR #2048

open

[Non-record submission] NativeBonsaiBinary 1 bit non-record submission (1.3551 BPB)

by kineticforgeView on GitHub
val_bpb
1.3551
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,087,895 bytes

Training Techniques

Architecture
weight tying
Tied embeddings were enabled.
parameters: null
GQA
Used grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":16,"kv_heads":4}
RoPE
Applied RoPE with a reduced rotary dimension.
parameters: {"rope_dim":16}
MLP3x
Used a SwiGLU MLP with 4x expansion.
parameters: {"mlp_mult":4}
Quantization
STE QAT
bits: 1
scope: grouped linear weights
Compression
lzma
level: null
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":500,"backend_steps":5,"matrix_lr":0.006,"scalar_lr":0.006,"tied_embed_lr":0.009}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":5,"warmdown_iters":8000}
Regularization
logit softcap
parameters: {"softcap":30,"mode":"tanh"}
Other
other
Trained with PyTorch CUDA DDP across 4x NVIDIA H100 80GB GPUs with synchronized stopping and per-rank data shard offset.
parameters: {"world_size":4,"hardware":"4x NVIDIA H100 80GB"}

Novel Contributions

  • Native 1-bit grouped-binary transformer submission inspired by Bonsai/Qwen3 1-bit structure
  • Grouped binary STE linears with learned scales
  • DDP rank data offset to avoid duplicated minibatches across ranks
  • Synchronized wallclock stop across DDP ranks to prevent training hangs
  • Native packet accounting with exported LZMA-compressed packet plus counted code size