PR #2067

open

Add shortchunk16 TTT record candidate

by jiashengguView on GitHub

val_bpb

1.0592

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,915,804 bytes

Training Techniques

Architecture

Gated Attention

Uses gated attention in the training stack.

parameters: null

SmearGate

Includes SparseGate/SmearGate-style gating in the model stack.

parameters: null

weight tying

Uses tied embeddings / weight tying.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: 0.97

other_params: {"row_normalization":true}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Quantization

mixed int6/int7

bits: 6

scope: weights and embeddings

GPTQ

bits: null

scope: block weights

Compression

custom

level: null

Test-Time Training

score-first TTT

parameters: {"lora_rank":224,"learning_rate":0.00007,"alpha":144,"beta1":0,"beta2":0.99,"local_lr_mult":0.875}

Sequence Length

sequence_length

train_length: null

eval_length: 2560

Other

other

Short-doc specialization for TTT: documents shorter than 2048 tokens use chunk size 16.

parameters: {"short_ttt_chunk_size":16,"threshold_tokens":2048}

Novel Contributions

3-seed record candidate for track_10min_16mb
SP8192 CaseOps/LQER/SparseGate/BOSFix training stack
Eval-only score-first ShortChunk16 LoRA TTT
Short-document TTT specialization with chunk size 16 for documents under 2048 tokens
Fresh quantized artifact used for eval-only TTT