PR #741

open

Record: Cosine TTT + Multi-Order N-gram Cache (3-seed mean val_bpb=0.9850)

by andrewbaggio1View on GitHub

val_bpb

0.9850

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.62 MB

Training Techniques

Quantization

int6

bits: 6

scope: all

Architecture

MLP3x

Transformer variant with expanded MLP projection layers as part of the custom architecture stack.

parameters: null

BigramHash

Hashed n-gram/count-sketch style component used for multi-order n-gram caching.

parameters: {"size":2048}

SmearGate

Custom gating mechanism included in the architecture stack.

parameters: null

XSA

Custom attention/sequence module included in the architecture stack.

parameters: {"version":"XSA4"}

Partial RoPE

Partial rotary positional embedding variant.

parameters: null

KV head count

Grouped-query attention with reduced KV heads.

parameters: {"kv_heads":4}

Weight Averaging

EMA

parameters: null

SWA

parameters: null

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"seq_len":2048}

multi-order n-gram cache interpolation

parameters: {"orders":[2,3,4,5],"entropy_adaptive_alpha":true}

Test-Time Training

full TTT

parameters: {"epochs":20,"learning_rate_schedule":"cosine","per_layer_lr_groups":true}

LR Schedule

cosine decay

parameters: null

Initialization

OrthoInit

Orthogonal initialization used in the custom architecture stack.

Regularization

layerwise LN scale

parameters: null

Novel Contributions

Combining cosine test-time training with multi-order n-gram cache interpolation
Entropy-adaptive alpha mixing between model and n-gram predictions
Score-first n-gram cache evaluation with single blended prediction per token
Single-pass cosine TTT adaptation with per-layer learning-rate groups
Breaking the sub-1.0 BPB barrier with a 3-seed mean val_bpb of 0.9850