PR #289

open

SmearGate + BigramHash + Int6 + SWA + U-Net Skips (1.1518 BPB)

by integrate-your-mindView on GitHub
val_bpb
1.1518
Architecture
GPT
Optimizer
Muon
Artifact Size
15.2MB

Training Techniques

Quantization
int6
bits: 6
scope: MLP and attention weights
Compression
zstd
level: 22
Architecture
MLP3x
Expanded MLP hidden size to 3x the model dimension using relu² activation.
parameters: {"hidden":1536,"multiplier":3}
SmearGate
Learned token-predecessor blending at the input to inject lightweight bigram context.
parameters: null
BigramHash
Hashed adjacent token-pair embedding table for bigram context.
parameters: {"buckets":2048,"dimension":128}
U-Net skip connections
Encoder-to-decoder skip connections with learned per-dimension weights.
parameters: {"layers":11}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"used_for":"embedding and scalar parameters"}
Weight Averaging
SWA
parameters: {"snapshots":7,"every_steps":200}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_iters":3000}
Regularization
weight decay
parameters: {"muon_weight_decay":0.04,"adam_weight_decay":0.04}

Novel Contributions

  • SmearGate learned token-predecessor blending at the input
  • BigramHash embedding with 2048 buckets for token-pair context
  • Per-row int6 quantization of MLP and attention weights
  • U-Net style skip connections with learned per-dimension weights
  • 3x MLP expansion with relu² activation
  • SWA snapshots during warmdown
  • Sliding-window evaluation with stride 64 as the primary score
  • TTT LoRA evaluation as an alternative inference-time adaptation method