PR #761

open

Record: Score-First TTT + N-gram Backoff (3-seed mean val_bpb=0.9581)

by Asukabot0View on GitHub

val_bpb

0.9581

Architecture

Transformer

Optimizer

AdamW

Artifact Size

~15.7 MB

Training Techniques

Architecture

XSA

Exclusive Self-Attention applied on all 11 layers to remove self-position bias.

parameters: {"layers":11}

LeakyReLU^2

Uses leaky_relu(x, 0.5).square() to preserve negative gradient flow.

parameters: {"negative_slope":0.5}

Value Residual

Layer 0 value output is mixed into subsequent layers via learned sigmoid gates.

parameters: null

Gated Attention

Per-head sigmoid gates on attention output.

parameters: null

SmearGate

Additional gating mechanism used in the model.

parameters: null

BigramHash

Bigram hashing feature with 4096 buckets.

parameters: {"buckets":4096}

Partial RoPE

Applies rotary positional embeddings to a subset of dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

MLP3x

Uses a 3x wider MLP.

parameters: {"multiplier":3}

GQA

Grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

U-Net skip connections

Skip connections inspired by U-Net are used in the transformer stack.

parameters: null

Initialization

OrthoInit

Orthogonal initialization used with SmearGate.

Weight Averaging

EMA

parameters: {"decay":0.997}

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Quantization

int6 per-row

bits: 6

scope: all

Compression

zstd

level: 16

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"chunk_size_tokens":131000,"learning_rate":0.0001,"epochs":4,"freeze_first_blocks":2,"grad_clip":1}

Other

other

Multi-order n-gram backoff cache with entropy-adaptive alpha mixing, using orders 2-7 and backward-looking cache updates only.

parameters: {"orders":[2,3,4,5,6,7]}

Novel Contributions

Score-first test-time training compliant with the issue constraints
Multi-order n-gram backoff cache with entropy-adaptive alpha
XSA applied to all 11 layers
LeakyReLU(0.5)^2 activation
Value Residual and Gated Attention integration
Int6 per-row quantization with zstd compression