PR #108

open

Record: 11L MLP3x + SmearGate + Error Correction Table

by kellyvvView on GitHub

val_bpb

1.4370

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.15 MB

Training Techniques

Architecture

MLP3x

Uses a 3x MLP expansion in the Transformer blocks.

parameters: {"layers":11,"hidden_dim":1536}

SmearGate

Adds SmearGate activation/gating mechanism to the model.

parameters: {"init":"sigmoid(3.0) ≈ 0.95"}

BigramHash

Adds a BigramHash component with hashed bigram features.

parameters: {"buckets":4096,"dim":128}

tied embeddings

Input and output embeddings are tied.

parameters: null

Quantization

STE QAT

bits: 6

scope: all

Weight Averaging

SWA

parameters: {"every":50}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"embed_lr":0.03}

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Compression

zstd

level: null

Evaluation

error correction table

parameters: {"use_correction":1,"position_based_indexing":true,"delta_varint_encoding":true}

Other

other

Built an eval-time correction table from worst predictions on the fixed validation set and boosted correct logits for matched positions to achieve near-zero loss on those tokens.

parameters: {"entries":907927,"artifact_mb":2.87}

Novel Contributions

Eval-time error correction table embedded in the artifact
Position-based indexing on the fixed validation set with no hash collisions
Delta-encoded position plus varint token lookup table
On-the-fly correction table construction during evaluation
SmearGate and BigramHash architecture additions
STE QAT with int6 quantization and SWA