PR #1460

open

Record: SP8192 + TTT + Eval-Time Hash Embedding — val_bpb 1.08269 (3-seed mean)

val_bpb

1.0827

Architecture

Transformer

Optimizer

SGD

Artifact Size

~15.99 MB

Training Techniques

Architecture

BigramHash

Zero-initialized eval-time embedding keyed by a bigram hash is added to token embeddings before RMSNorm and trained during TTT.

parameters: {"vocab_size":16384,"embedding_dim":512,"hash_mod":16384,"hash_multiplier":2039}

depth recurrence

Layers 4-5 use a recurrent loop applied twice.

parameters: {"layers":[4,5],"repeats":2}

KV head count

Uses grouped key/value heads in the SP8192 stack.

parameters: {"heads":8,"kv_heads":4}

XSA

Applies XSA across all layers in the SP8192 architecture.

parameters: {"layers":11}

U-Net skip connections

Parallel residual / skip-gated connections are used in the architecture.

parameters: {"layers":[7,8,9,10]}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"momentum":0.9,"epochs_per_chunk":3,"chunk_size":32000,"freeze":0}

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"lr":0.005}

LR Schedule

cosine decay

parameters: null

Quantization

GPTQ

bits: 6

scope: all

Compression

lzma

level: null

Eval-time hash embedding trained from zeros during score-first TTT
Bigram-hash residual memory added before RMSNorm
Record 3-seed mean val_bpb of 1.08269
SP8192 stack combining parallel residuals, depth recurrence, skip gates, and compressed artifact packaging