PR #289

open

SmearGate + BigramHash + Int6 + SWA + U-Net Skips (1.1518 BPB)

by integrate-your-mindView on GitHub

val_bpb

1.1518

Architecture

GPT

Optimizer

Muon

Artifact Size

15.2MB

Training Techniques

Quantization

int6

bits: 6

scope: MLP and attention weights

Compression

zstd

level: 22

Architecture

MLP3x

Expanded MLP hidden size to 3x the model dimension using relu² activation.

parameters: {"hidden":1536,"multiplier":3}

SmearGate

Learned token-predecessor blending at the input to inject lightweight bigram context.

parameters: null

BigramHash

Hashed adjacent token-pair embedding table for bigram context.

parameters: {"buckets":2048,"dimension":128}

U-Net skip connections

Encoder-to-decoder skip connections with learned per-dimension weights.

parameters: {"layers":11}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"used_for":"embedding and scalar parameters"}

Weight Averaging

SWA

parameters: {"snapshots":7,"every_steps":200}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.01}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_iters":3000}

Regularization

weight decay

parameters: {"muon_weight_decay":0.04,"adam_weight_decay":0.04}

Novel Contributions

SmearGate learned token-predecessor blending at the input
BigramHash embedding with 2048 buckets for token-pair context
Per-row int6 quantization of MLP and attention weights
U-Net style skip connections with learned per-dimension weights
3x MLP expansion with relu² activation
SWA snapshots during warmdown
Sliding-window evaluation with stride 64 as the primary score
TTT LoRA evaluation as an alternative inference-time adaptation method