PR #958

open

Submission: DominationV2 + BOS-Reset Bigram Cache + TTT (val_bpb=1.1382, 3-seed mean)

by shouryamaanjainView on GitHub
val_bpb
1.1382
Architecture
Transformer
Optimizer
Artifact Size
~15.5 MB

Training Techniques

Architecture
BigramHash
Uses a bigram hash component with 2048 buckets and 128-dimensional embeddings.
parameters: {"buckets":2048,"dimensions":128}
SmearGate
Per-dimension blend with the previous token.
parameters: null
U-Net skip connections
U-Net style encoder-decoder with skip connections.
parameters: {"encoder_layers":5,"decoder_layers":6}
ReLU²
Uses relu squared MLP activation.
parameters: null
XSA
Applies XSA in the last 4 layers.
parameters: {"layers":4}
Initialization
OrthoInit
Orthogonal initialization with depth scaling.
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
mixed int6/int8
bits: null
scope: all
Compression
zstd
level: 22
Test-Time Training
score-first TTT
parameters: {"epochs":3,"learning_rate":0.0001}
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
BOS-reset bigram cache applied during evaluation, blending model probabilities with document-local bigram counts and resetting at BOS tokens.
parameters: {"alpha":0.2,"tau":8,"entropy_power":1}

Novel Contributions

  • BOS-reset bigram cache for evaluation-time probability blending
  • Score-first test-time training (TTT) after quantization roundtrip
  • DominationV2 architecture stack with BigramHash, SmearGate, XSA, and U-Net skip connections
  • Mixed int6/int8 quantization with zstd-22 compression