PR #993

closed

Record: 11L XSA + Mixed INT6 + Adaptive N-gram Cache (2->7 backoff) - val_bpb=0.9631, 3-seed

val_bpb

0.9631

Architecture

Transformer

Optimizer

—

Artifact Size

15,882,569 bytes

Training Techniques

Architecture

XSA

XSA applied to all 11 layers in an 11-layer Transformer with 512d hidden size, 8Q and 4KV heads.

parameters: {"layers":11,"hidden_dim":512,"q_heads":8,"kv_heads":4}

MLP3x

Three-times wider MLP using relu2 activation.

parameters: {"multiplier":3,"activation":"ReLU²"}

weight tying

Tied embeddings.

parameters: null

Weight Averaging

EMA + SWA

parameters: null

Quantization

mixed int6

bits: 6

scope: post-training mixed

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

Adaptive score-first n-gram cache with backoff orders 2->7, applied only to later positions/windows after scoring earlier windows.

parameters: {"orders":"2->7","adaptive_mode":"sigmoid_raw_entropy","alpha_range":[0.05,0.6],"hash_buckets":4194304,"min_count":2}