PR #1388

open

[Notable Non-Record Submission] Everything Everywhere All in One Bit: XNOR-mally I'd use floats - 118M XNOR-Net - 1.539 BPB - 10-Min and Unconstrained Runs

by CiprianFlorin-IfrimView on GitHub

val_bpb

1.5390

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.91MB

Training Techniques

Architecture

U-Net skip connections

Transformer split into encoder/decoder halves with learnable skip connections between corresponding layers.

parameters: {"layers":10,"model_dim":1024,"heads":8,"kv_heads":4,"mlp_mult":4,"embed_dim":384}

GQA

Uses grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

weight tying

Input embeddings and output head are tied.

parameters: null

RoPE

Uses YaRN-scaled rotary positional embeddings.

parameters: {"base":5000,"max_len":2048}

signsq

Activation function defined as x * abs(x) to preserve signed information under activation binarization.

parameters: null

SmearGate

Smear module was explored as an architectural variant, but not used in the final config.

parameters: null

Quantization

STE QAT

bits: null

scope: weights and activations

FP8

bits: 8

scope: FP parameters and scales

BF16 scales

bits: 16

scope: group scales

activation binarization

bits: 1

scope: activations

Optimizer

Muon

weight_decay: 0.04

momentum: 0.8

other_params: {"backend_steps":3}

Weight Averaging

EMA

parameters: {"start_step":0}

Compression

brotli

level: 11

Evaluation

sliding window eval

parameters: {"stride":16}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

cosine decay

parameters: null

Regularization

logit softcap

parameters: {"value":10}

gradient clipping

parameters: {"norm":1}

Novel Contributions

Full XNOR-Net language model with both weight and activation binarization
Mode 2 activation binarization that skips the MLP down projection to avoid the full-XNOR quality ceiling
signsq activation to preserve signed information under activation binarization
Scale QAT to eliminate long-run roundtrip degradation from FP8 scale quantization
Custom Triton XNOR+popcount kernel for true 1-bit matrix multiplication
Sequence length scheduling from short to long contexts during training
Low-momentum Muon optimization tuned for binary STE training
U-Net skip connections as an alternative to attention residuals for binary transformers