PR #795

open

Record: 11L + order-adaptive 11-gram (mean val_bpb=0.8881)

by hypery11View on GitHub
val_bpb
0.8881
Architecture
Transformer
Optimizer
Artifact Size
13.99 MB

Training Techniques

Architecture
XSA-all
11-layer transformer variant using XSA-all attention.
parameters: {"layers":11,"dim":512,"gqa_heads":"8/4"}
BigramHash
Bigram hash module used as part of the architecture.
parameters: {"dimensions":128,"size":10240}
SmearGate
Gating component included in the model architecture.
parameters: null
MLP3x
Uses a 3x MLP block.
parameters: null
Gated Attention
Attention mechanism includes gating.
parameters: null
Value Residual
Adds value residual connections.
parameters: null
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
zstd
level: 22
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"type":"Tight SWA"}
Evaluation
order-adaptive n-gram backoff cache
parameters: {"orders":"2-11","highest_order_first":true,"entropy_gating":true,"score_first":true,"deterministic":true}
Test-Time Training
score-first TTT
parameters: {"enabled":false}
Regularization
layerwise LN scaling
parameters: null

Novel Contributions

  • 11-layer XSA-all transformer
  • Order-adaptive entropy-gated n-gram backoff from orders 2 to 11
  • Higher-order matches use lower entropy thresholds
  • GPTQ-lite int6 quantization combined with zstd-22 compression
  • Score-first, deterministic evaluation without TTT