PR #187

open

Record: Pre-Enrichment + Encoder Recurrence + XSA + SmearGate + BigramHash (val_bpb=1.1629)

by Idan3011View on GitHub

val_bpb

1.1629

Architecture

U-Net Transformer

Optimizer

Muon + AdamW

Artifact Size

15.05 MB

Training Techniques

Architecture

BigramHash

Hash-table embedding for token bigrams projected to model dimension and added before the residual stream.

parameters: {"table_size":"4096x64"}

SmearGate

Per-dimension learnable gate blending each token with the previous token's embedding.

parameters: {"parameters":512}

MLP3x

Uses a 3x MLP width configuration.

parameters: {"multiplier":3}

depth recurrence

Applies encoder recurrence by running the encoder blocks twice with RMS norm stabilization between passes.

parameters: {"passes":2,"encoder_layers":5,"decoder_layers":5}

XSA

Exclusive Self Attention removes self-value bias from attention output via orthogonal projection on the last 4 layers.

parameters: {"last_n_layers":4}

pre-enrichment

Wider nonlinear embedding transformation before the residual stream: 512→768→512 with GELU and RMS norm.

parameters: {"input_dim":512,"hidden_dim":768,"output_dim":512}

Quantization

int6 QAT

bits: 6

scope: all

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3300}

Regularization

weight decay

parameters: {"muon_wd":0.04,"adam_wd":0.04}

Initialization

overtone init

Non-standard initialization adapted from prior work.

Other

other

GELU pre-enrichment block before transformer layers.

parameters: {"bottleneck":"512->768->512"}

Novel Contributions

GELU pre-enrichment with a wider 512→768→512 bottleneck before the transformer blocks
2x encoder recurrence applied only to the encoder half of a U-Net transformer architecture
Exclusive Self Attention (XSA) on the last 4 layers to remove self-value bias
SmearGate for token-to-previous-token embedding blending
BigramHash token bigram embedding added to the input representation
EMA replacing SWA to reduce quantization gap
Int6 QAT with lzma compression to fit within the artifact limit