PR #2067

open

Add shortchunk16 TTT record candidate

by jiashengguView on GitHub
val_bpb
1.0592
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,915,804 bytes

Training Techniques

Architecture
Gated Attention
Uses gated attention in the training stack.
parameters: null
SmearGate
Includes SparseGate/SmearGate-style gating in the model stack.
parameters: null
weight tying
Uses tied embeddings / weight tying.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"row_normalization":true}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Quantization
mixed int6/int7
bits: 6
scope: weights and embeddings
GPTQ
bits: null
scope: block weights
Compression
custom
level: null
Test-Time Training
score-first TTT
parameters: {"lora_rank":224,"learning_rate":0.00007,"alpha":144,"beta1":0,"beta2":0.99,"local_lr_mult":0.875}
Sequence Length
sequence_length
train_length: null
eval_length: 2560
Other
other
Short-doc specialization for TTT: documents shorter than 2048 tokens use chunk size 16.
parameters: {"short_ttt_chunk_size":16,"threshold_tokens":2048}

Novel Contributions

  • 3-seed record candidate for track_10min_16mb
  • SP8192 CaseOps/LQER/SparseGate/BOSFix training stack
  • Eval-only score-first ShortChunk16 LoRA TTT
  • Short-document TTT specialization with chunk size 16 for documents under 2048 tokens
  • Fresh quantized artifact used for eval-only TTT