PR #795

open

Record: 11L + order-adaptive 11-gram (mean val_bpb=0.8881)

by hypery11View on GitHub

val_bpb

0.8881

Architecture

Transformer

Optimizer

—

Artifact Size

13.99 MB

Training Techniques

Architecture

XSA-all

11-layer transformer variant using XSA-all attention.

parameters: {"layers":11,"dim":512,"gqa_heads":"8/4"}

BigramHash

Bigram hash module used as part of the architecture.

parameters: {"dimensions":128,"size":10240}

SmearGate

Gating component included in the model architecture.

parameters: null

MLP3x

Uses a 3x MLP block.

parameters: null

Gated Attention

Attention mechanism includes gating.

parameters: null

Value Residual

Adds value residual connections.

parameters: null

Quantization

GPTQ-lite

bits: 6

scope: all

Compression

zstd

level: 22

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"type":"Tight SWA"}

Evaluation

order-adaptive n-gram backoff cache

parameters: {"orders":"2-11","highest_order_first":true,"entropy_gating":true,"score_first":true,"deterministic":true}

Test-Time Training

score-first TTT

parameters: {"enabled":false}

Regularization

layerwise LN scaling

parameters: null

Novel Contributions

11-layer XSA-all transformer
Order-adaptive entropy-gated n-gram backoff from orders 2 to 11
Higher-order matches use lower entropy thresholds
GPTQ-lite int6 quantization combined with zstd-22 compression
Score-first, deterministic evaluation without TTT