PR #1388

open

[Notable Non-Record Submission] Everything Everywhere All in One Bit: XNOR-mally I'd use floats - 118M XNOR-Net - 1.539 BPB - 10-Min and Unconstrained Runs

by CiprianFlorin-IfrimView on GitHub
val_bpb
1.5390
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.91MB

Training Techniques

Architecture
U-Net skip connections
Transformer split into encoder/decoder halves with learnable skip connections between corresponding layers.
parameters: {"layers":10,"model_dim":1024,"heads":8,"kv_heads":4,"mlp_mult":4,"embed_dim":384}
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
weight tying
Input embeddings and output head are tied.
parameters: null
RoPE
Uses YaRN-scaled rotary positional embeddings.
parameters: {"base":5000,"max_len":2048}
signsq
Activation function defined as x * abs(x) to preserve signed information under activation binarization.
parameters: null
SmearGate
Smear module was explored as an architectural variant, but not used in the final config.
parameters: null
Quantization
STE QAT
bits: null
scope: weights and activations
FP8
bits: 8
scope: FP parameters and scales
BF16 scales
bits: 16
scope: group scales
activation binarization
bits: 1
scope: activations
Optimizer
Muon
weight_decay: 0.04
momentum: 0.8
other_params: {"backend_steps":3}
Weight Averaging
EMA
parameters: {"start_step":0}
Compression
brotli
level: 11
Evaluation
sliding window eval
parameters: {"stride":16}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
cosine decay
parameters: null
Regularization
logit softcap
parameters: {"value":10}
gradient clipping
parameters: {"norm":1}

Novel Contributions

  • Full XNOR-Net language model with both weight and activation binarization
  • Mode 2 activation binarization that skips the MLP down projection to avoid the full-XNOR quality ceiling
  • signsq activation to preserve signed information under activation binarization
  • Scale QAT to eliminate long-run roundtrip degradation from FP8 scale quantization
  • Custom Triton XNOR+popcount kernel for true 1-bit matrix multiplication
  • Sequence length scheduling from short to long contexts during training
  • Low-momentum Muon optimization tuned for binary STE training
  • U-Net skip connections as an alternative to attention residuals for binary transformers