PR #126

open

Non-record: BitNet b1.58 + depth recurrence + NorMuon (1.7510 BPB, 3.78 MB)

by Athenox14View on GitHub

val_bpb

1.7510

Architecture

Transformer

Optimizer

Muon

Artifact Size

3.78 MB

Training Techniques

Quantization

QAT

bits: 2

scope: all weights

Architecture

depth recurrence

4 unique transformer blocks are reused 3 times each for 12 effective layers, with U-Net style skip connections between encoder and decoder halves.

parameters: {"unique_layers":4,"recurrence_factor":3,"effective_layers":12}

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

QK-norm

Applies RMSNorm to Q and K before RoPE.

parameters: null

logit softcapping

Uses tanh-based softcapping on logits.

parameters: {"cap":30}

RoPE

Uses NTK-aware RoPE base scaling with YaRN-style sequence length warmup.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"row_wise_rms_normalization":true,"newton_schulz_orthogonalization":true}

Compression

zlib

level: 9

Evaluation

sliding window eval

parameters: {"stride":"seq_len // 2","skip_cold_start_tokens":true}

Initialization

proj zero-init

Output projections of attention and MLP are zero-initialized so each block starts as identity.

resid_mix

Learnable per-block mixing of current hidden state with original embedding, initialized to [1, 0].

LR Schedule

linear warmup + constant + cosine cooldown

parameters: {"warmup_steps":100,"cooldown_steps":2000,"final_lr_multiplier":0.1}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Other

other

Sequence length warmup from 128 to 1024 over 2000 steps with NTK-aware RoPE base scaling (YaRN-style).

parameters: {"start_length":128,"end_length":1024,"warmup_steps":2000}

Novel Contributions

BitNet b1.58 ternary quantization with packed 2-bit weights and zlib compression
Depth recurrence with 4 unique transformer blocks reused 3 times for 12 effective layers
U-Net style skip connections across recurrent block passes
Learnable resid_mix parameter to blend recurrent hidden state with original embedding
NorMuon optimizer with per-neuron row-wise RMS normalization after Newton-Schulz orthogonalization
Sequence length warmup combined with YaRN / NTK-aware RoPE scaling
Sliding-window evaluation with cold-start token skipping
QK-norm and logit softcapping