PR #761

open

Record: Score-First TTT + N-gram Backoff (3-seed mean val_bpb=0.9581)

by Asukabot0View on GitHub
val_bpb
0.9581
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.7 MB

Training Techniques

Architecture
XSA
Exclusive Self-Attention applied on all 11 layers to remove self-position bias.
parameters: {"layers":11}
LeakyReLU^2
Uses leaky_relu(x, 0.5).square() to preserve negative gradient flow.
parameters: {"negative_slope":0.5}
Value Residual
Layer 0 value output is mixed into subsequent layers via learned sigmoid gates.
parameters: null
Gated Attention
Per-head sigmoid gates on attention output.
parameters: null
SmearGate
Additional gating mechanism used in the model.
parameters: null
BigramHash
Bigram hashing feature with 4096 buckets.
parameters: {"buckets":4096}
Partial RoPE
Applies rotary positional embeddings to a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
MLP3x
Uses a 3x wider MLP.
parameters: {"multiplier":3}
GQA
Grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
U-Net skip connections
Skip connections inspired by U-Net are used in the transformer stack.
parameters: null
Initialization
OrthoInit
Orthogonal initialization used with SmearGate.
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Quantization
int6 per-row
bits: 6
scope: all
Compression
zstd
level: 16
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"chunk_size_tokens":131000,"learning_rate":0.0001,"epochs":4,"freeze_first_blocks":2,"grad_clip":1}
Other
other
Multi-order n-gram backoff cache with entropy-adaptive alpha mixing, using orders 2-7 and backward-looking cache updates only.
parameters: {"orders":[2,3,4,5,6,7]}

Novel Contributions

  • Score-first test-time training compliant with the issue constraints
  • Multi-order n-gram backoff cache with entropy-adaptive alpha
  • XSA applied to all 11 layers
  • LeakyReLU(0.5)^2 activation
  • Value Residual and Gated Attention integration
  • Int6 per-row quantization with zstd compression