PR #706

open

Podracing: 1.0461 BPB (3-seed mean) — 5-gram eval + LeakyReLU²

val_bpb

1.0461

Architecture

11L/512d U-Net

Optimizer

—

Artifact Size

15.64 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Architecture

XSA

Uses XSA attention with the last 4 layers modified.

parameters: {"last_n":4}

BigramHash

Adds a BigramHash component for hashed n-gram features.

parameters: {"vocab_size":1536}

RoPE

Uses partial rotary positional embeddings.

parameters: {"dimensions":24}

tied embeddings

Input and output embeddings are tied.

parameters: null

Other

other

LeakyReLU squared activation with slope 0.5.

parameters: {"slope":0.5}

Evaluation

5-gram eval interpolation

parameters: {"alpha":0.2,"order":5,"min_count":2,"buckets":4194304,"score_first":true,"legal":true}

Test-Time Training

score-first TTT

parameters: {"disabled":true}

Compression

zstd

level: null

5-gram eval interpolation using a fixed-weight hashed n-gram cache built from already-scored tokens only
Score-first legal evaluation with no safety gate or target-aware selection
LeakyReLU squared activation
XSA last-4 configuration with BigramHash and partial RoPE
GPTQ int6 quantization with late QAT