PR #65

RECORDclosed

Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556

by aquariouseworkmanView on GitHub

val_bpb

1.1556

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.1MB

Training Techniques

Quantization

mixed int6/int8 STE QAT

bits: 6

scope: all 2D block weights int6; token embeddings int8/fp16 passthrough

Architecture

SmearGate

Learned per-dimension gate blends current token embedding with previous token embedding before transformer layers.

parameters: {"dim":512}

BigramHash

Hash-based bigram embedding over consecutive token pairs to inject token-pair context.

parameters: {"buckets":4096,"dim":128}

MLP3x

Expanded MLP hidden size to 3x model dimension for greater capacity.

parameters: {"multiplier":3,"hidden_dim":1536}

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

U-Net skip connections

Encoder-decoder style skip connections between corresponding transformer layers.

parameters: {"layers":9}

Optimizer

Muon

weight_decay: 0.01

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"backend_steps":5}

Initialization

OrthoInit

Orthogonal initialization for non-zero-init linear weights.

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":1024}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

linear warmup + warmdown

parameters: {"warmup_steps":20,"warmdown_iters":3000}

Regularization

weight decay

parameters: {"weight_decay":0.01}

Compression

zstd

level: 22

Novel Contributions

SmearGate embedding that blends current and previous token embeddings
Bigram hash embedding for direct token-pair features
Orthogonal weight initialization combined with Muon optimization
Mixed int6/int8 quantization-aware training with STE
Wider 3x MLP expansion enabled by quantization savings
U-Net style skip connections in a transformer
Sliding window evaluation with stride 64