PR #1122
openRecord: EngramLite + Gated Skips + Full GPTQ + FA3 — val_bpb 1.1146 (1-seed, 2 pending)
by icryoView on GitHub
val_bpb
1.1146
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.71 MB
Training Techniques
Architecture
EngramLite
Multi-head bigram+trigram hash embeddings with learned sigmoid gate, replacing BigramHash.
parameters: {"buckets":8192,"heads":2,"orders":2,"dim_per_head":32}
U-Net skip connections
Sigmoid-gated skip connections on U-Net skips using learned gates.
parameters: null
XSA
XSA applied to all layers.
parameters: {"layers":11}
LeakyReLU
LeakyReLU squared activation with negative slope 0.3.
parameters: {"negative_slope":0.3,"squared":true}
Quantization
GPTQ
bits: 6
scope: all weights
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"ns_steps":4}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"frequency_steps":50,"scale_threshold":0.2}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"lr_floor":0.05}
Regularization
LN scale
parameters: {"scale":"1/sqrt(l+1)"}
Other
other
Coprime-stride multi-shard data loader for diverse batches across 80 shards.
parameters: {"shards":80}
other
FlashAttention 3 on Hopper native hardware.
parameters: null
Novel Contributions
- EngramLite multi-head bigram+trigram hash embeddings
- Sigmoid-gated skip connections on U-Net skips
- Full Hessian GPTQ with Cholesky error compensation
- Coprime-stride multi-shard loader across 80 shards
- XSA applied to all 11 layers
- FlashAttention 3 Hopper-native setup
- Combined stack from prior PR innovations