PR #2048
open[Non-record submission] NativeBonsaiBinary 1 bit non-record submission (1.3551 BPB)
by kineticforgeView on GitHub
val_bpb
1.3551
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,087,895 bytes
Training Techniques
Architecture
weight tying
Tied embeddings were enabled.
parameters: null
GQA
Used grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":16,"kv_heads":4}
RoPE
Applied RoPE with a reduced rotary dimension.
parameters: {"rope_dim":16}
MLP3x
Used a SwiGLU MLP with 4x expansion.
parameters: {"mlp_mult":4}
Quantization
STE QAT
bits: 1
scope: grouped linear weights
Compression
lzma
level: null
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":500,"backend_steps":5,"matrix_lr":0.006,"scalar_lr":0.006,"tied_embed_lr":0.009}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":5,"warmdown_iters":8000}
Regularization
logit softcap
parameters: {"softcap":30,"mode":"tanh"}
Other
other
Trained with PyTorch CUDA DDP across 4x NVIDIA H100 80GB GPUs with synchronized stopping and per-rank data shard offset.
parameters: {"world_size":4,"hardware":"4x NVIDIA H100 80GB"}
Novel Contributions
- Native 1-bit grouped-binary transformer submission inspired by Bonsai/Qwen3 1-bit structure
- Grouped binary STE linears with learned scales
- DDP rank data offset to avoid duplicated minibatches across ranks
- Synchronized wallclock stop across DDP ranks to prevent training hangs
- Native packet accounting with exported LZMA-compressed packet plus counted code size