PR #788

open

Record: 11L + order-adaptive 9-gram backoff (mean val_bpb=0.9059)

by hypery11View on GitHub
val_bpb
0.9059
Architecture
Transformer
Optimizer
Artifact Size
13.99 MB

Training Techniques

Architecture
XSA
Exclusive Self-Attention applied to all layers
parameters: {"layers":11}
LeakyReLU(0.5)^2
Uses squared LeakyReLU activation in the MLP
parameters: null
Value Residual
Adds value residual connections
parameters: null
Gated Attention
Uses gated attention mechanism
parameters: null
BigramHash
Hash-based bigram feature module
parameters: {"dimensions":128,"size":10240}
SmearGate
Gating module used in the architecture
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"type":"Tight SWA"}
Compression
zstd
level: 22
Evaluation
order-adaptive entropy-gated n-gram backoff cache
parameters: {"orders":[2,3,4,5,6,7,8,9],"deterministic":true,"score_first":true}
Other
other
Late QAT
parameters: null

Novel Contributions

  • 11-layer transformer with XSA-all
  • Order-adaptive entropy-gated n-gram backoff cache from 2-gram to 9-gram
  • Higher-order n-gram matches use lower entropy thresholds for mixing
  • Score-first, deterministic inference without TTT
  • GPTQ-lite int6 compression with zstd-22