PR #997

open

Non-record: 24.7M params · int6 · Binary U-Net/SmearGate/BigramHash · 1.5hr · RTX 5060 Ti 16GB

by randy06122001-boopView on GitHub

val_bpb

1.4182

Architecture

Transformer

Optimizer

Muon

Artifact Size

11.63MB

Training Techniques

Quantization

int6

bits: 6

scope: block weights

Architecture

U-Net skip connections

10-layer U-Net style transformer with 5 encoder and 5 decoder blocks

parameters: {"layers":10,"encoder_blocks":5,"decoder_blocks":5}

SmearGate

Causal blending of token embeddings with previous context

parameters: null

BigramHash

4096-bucket hash embedding for consecutive token pairs

parameters: {"buckets":4096}

MLP3x

3x MLP expansion with ReLU² activation

parameters: {"hidden":1536}

GQA

Grouped query attention with 4 KV heads

parameters: {"heads":8,"kv_heads":4,"dimension":512}

weight tying

Tied 1024-vocab embedding

parameters: {"vocab_size":1024}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"newton_schulz_steps":5}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"used_for":"scalar parameters and embeddings"}

Weight Averaging

SWA

parameters: {"checkpoints_averaged":20,"phase":"warmdown"}

Initialization

OrthoInit

Orthogonal initialization for all matrix weights

Compression

zstd

level: 22

Novel Contributions

Int6 quantization for block weights
Binary U-Net style transformer with 10 layers
SmearGate causal embedding blending
BigramHash token-pair hash embeddings
Muon optimization with SWA
ReLU² MLP expansion
Tied embeddings with GQA