PR #170

open

Record: Int6 QAT + SmearGate + Muon WD (val_bpb=1.1669)

by baudrillardsgh0stView on GitHub

val_bpb

1.1669

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.77 MB

Training Techniques

Quantization

STE QAT

bits: 6

scope: all weights

Compression

zstd

level: 22

Architecture

SmearGate

Learned gate blending current and previous token embeddings to add cheap bigram context.

parameters: {"params":513}

tied embeddings

Input/output embeddings are tied, with fp16 passthrough to avoid compounding quantization errors.

parameters: null

Optimizer

Muon

weight_decay: 0.01

momentum: null

other_params: {"decoupled_weight_decay":true}

Evaluation

sliding window eval

parameters: {"stride":64,"batch_seqs":32}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Regularization

weight decay

parameters: {"value":0.01,"decoupled":true}

Int6 QAT with STE fake quantization and per-row symmetric scaling
Int6 values stored in int8 containers with zstd-22 compression
SmearGate learned embedding-level bigram context
Decoupled Muon weight decay for improved generalization and quantization robustness
Sliding-window full-context evaluation
FP16 tied embedding passthrough