PR #741

open

Record: Cosine TTT + Multi-Order N-gram Cache (3-seed mean val_bpb=0.9850)

by andrewbaggio1View on GitHub
val_bpb
0.9850
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.62 MB

Training Techniques

Quantization
int6
bits: 6
scope: all
Architecture
MLP3x
Transformer variant with expanded MLP projection layers as part of the custom architecture stack.
parameters: null
BigramHash
Hashed n-gram/count-sketch style component used for multi-order n-gram caching.
parameters: {"size":2048}
SmearGate
Custom gating mechanism included in the architecture stack.
parameters: null
XSA
Custom attention/sequence module included in the architecture stack.
parameters: {"version":"XSA4"}
Partial RoPE
Partial rotary positional embedding variant.
parameters: null
KV head count
Grouped-query attention with reduced KV heads.
parameters: {"kv_heads":4}
Weight Averaging
EMA
parameters: null
SWA
parameters: null
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
multi-order n-gram cache interpolation
parameters: {"orders":[2,3,4,5],"entropy_adaptive_alpha":true}
Test-Time Training
full TTT
parameters: {"epochs":20,"learning_rate_schedule":"cosine","per_layer_lr_groups":true}
LR Schedule
cosine decay
parameters: null
Initialization
OrthoInit
Orthogonal initialization used in the custom architecture stack.
Regularization
layerwise LN scale
parameters: null

Novel Contributions

  • Combining cosine test-time training with multi-order n-gram cache interpolation
  • Entropy-adaptive alpha mixing between model and n-gram predictions
  • Score-first n-gram cache evaluation with single blended prediction per token
  • Single-pass cosine TTT adaptation with per-layer learning-rate groups
  • Breaking the sub-1.0 BPB barrier with a 3-seed mean val_bpb of 0.9850