PR #187
openRecord: Pre-Enrichment + Encoder Recurrence + XSA + SmearGate + BigramHash (val_bpb=1.1629)
by Idan3011View on GitHub
val_bpb
1.1629
Architecture
U-Net Transformer
Optimizer
Muon + AdamW
Artifact Size
15.05 MB
Training Techniques
Architecture
BigramHash
Hash-table embedding for token bigrams projected to model dimension and added before the residual stream.
parameters: {"table_size":"4096x64"}
SmearGate
Per-dimension learnable gate blending each token with the previous token's embedding.
parameters: {"parameters":512}
MLP3x
Uses a 3x MLP width configuration.
parameters: {"multiplier":3}
depth recurrence
Applies encoder recurrence by running the encoder blocks twice with RMS norm stabilization between passes.
parameters: {"passes":2,"encoder_layers":5,"decoder_layers":5}
XSA
Exclusive Self Attention removes self-value bias from attention output via orthogonal projection on the last 4 layers.
parameters: {"last_n_layers":4}
pre-enrichment
Wider nonlinear embedding transformation before the residual stream: 512→768→512 with GELU and RMS norm.
parameters: {"input_dim":512,"hidden_dim":768,"output_dim":512}
Quantization
int6 QAT
bits: 6
scope: all
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3300}
Regularization
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
Initialization
overtone init
Non-standard initialization adapted from prior work.
Other
other
GELU pre-enrichment block before transformer layers.
parameters: {"bottleneck":"512->768->512"}
Novel Contributions
- GELU pre-enrichment with a wider 512→768→512 bottleneck before the transformer blocks
- 2x encoder recurrence applied only to the encoder half of a U-Net transformer architecture
- Exclusive Self Attention (XSA) on the last 4 layers to remove self-value bias
- SmearGate for token-to-previous-token embedding blending
- BigramHash token bigram embedding added to the input representation
- EMA replacing SWA to reduce quantization gap
- Int6 QAT with lzma compression to fit within the artifact limit