PR #958

open

Submission: DominationV2 + BOS-Reset Bigram Cache + TTT (val_bpb=1.1382, 3-seed mean)

by shouryamaanjainView on GitHub

val_bpb

1.1382

Architecture

Transformer

Optimizer

—

Artifact Size

~15.5 MB

Training Techniques

Architecture

BigramHash

Uses a bigram hash component with 2048 buckets and 128-dimensional embeddings.

parameters: {"buckets":2048,"dimensions":128}

SmearGate

Per-dimension blend with the previous token.

parameters: null

U-Net skip connections

U-Net style encoder-decoder with skip connections.

parameters: {"encoder_layers":5,"decoder_layers":6}

ReLU²

Uses relu squared MLP activation.

parameters: null

XSA

Applies XSA in the last 4 layers.

parameters: {"layers":4}

Initialization

OrthoInit

Orthogonal initialization with depth scaling.

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

mixed int6/int8

bits: null

scope: all

Compression

zstd

level: 22

Test-Time Training

score-first TTT

parameters: {"epochs":3,"learning_rate":0.0001}

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

BOS-reset bigram cache applied during evaluation, blending model probabilities with document-local bigram counts and resetting at BOS tokens.

parameters: {"alpha":0.2,"tau":8,"entropy_power":1}

Novel Contributions

BOS-reset bigram cache for evaluation-time probability blending
Score-first test-time training (TTT) after quantization roundtrip
DominationV2 architecture stack with BigramHash, SmearGate, XSA, and U-Net skip connections
Mixed int6/int8 quantization with zstd-22 compression