PR #641

open

Notable Non-Record Submission: 1.1239 BPB - 106.2M Binary Asymmetric U-Net + NeoMuon + 4xrelu²MLP + Smear + Fact Tied Emb + Poly5 Softcap + YaRN2048 + 8192BPE + FP8 + Bit-packing LZMA + Stride-16 Eval - 2h

by CiprianFlorin-IfrimView on GitHub

val_bpb

1.1239

Architecture

Asymmetric Binary U-Net Transformer

Optimizer

NeoMuon

Artifact Size

15.67MB

Training Techniques

Quantization

1-bit binary quantisation

bits: 1

scope: all weights

Architecture

SmearGate

causal cumulative mean blending with learned tanh gate, zero-init for safe residual start

parameters: null

Factored tied embedding

8192×254 bottleneck with learned projections

parameters: {"vocab_size":8192,"embedding_dim":254}

YaRN positional encoding

max_len=2048, ROPE_BASE=5000

parameters: {"max_len":2048,"rope_base":5000}

U-Net encoder/decoder

15 transformer layers (7 encoder, 8 decoder) with learned skip weights (ones-init) and per-block residual mix from input embedding

parameters: {"layers":15,"dim":768,"heads":8,"kv_heads":4,"head_dim":96}

MLP

4x expansion with relu² activation, fused gate+up projection

parameters: {"expansion_factor":4,"hidden_dim":3072,"activation":"relu²"}

Optimizer

NeoMuon

weight_decay: 0

momentum: 0.95

other_params: {"muon_backend_steps":3,"muon_momentum_warmup_start":0.85,"muon_momentum_warmup_steps":500}

Evaluation

sliding window eval

parameters: {"stride":16,"temperature_scaling":0.9}

Compression

bit-packing + LZMA

level: 9

Regularization

Polynomial softcap with Z-loss regularisation

parameters: {"degree":5,"cap":10,"z_loss_weight":0.0001}

Other

other

No EMA used as it hurts quality by 0.03 bpb despite clean binary roundtrip math

parameters: null

Sequence Length

sequence_length

train_length: 1024

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_fraction":0.2}

Novel Contributions

Demonstration of 1-bit binary quantisation enabling 106.2M parameters in 15.67MB artifact, packing 60% more parameters per MB than ternary quantisation
Use of SmearGate: causal cumulative mean blending with learned tanh gate to improve performance despite added compute overhead
4x relu² MLP expansion shown to strictly dominate relu and outperform 3x width MLPs at matched budget
Factored tied embedding with bottleneck dimension 254 for 8192 vocab size
Use of NeoMuon optimizer with 3 Newton-Schulz steps for training
Sliding window evaluation with stride 16 and temperature scaling (T=0.90) for improved evaluation accuracy
Bit-packing combined with LZMA compression to achieve artifact size under 16MB
Demonstration that extended training (50k steps, ~2.15h) surpasses ternary quantisation quality despite slower convergence
No EMA used as it degrades quality in this binary quantised setting