PR #1932

open

Add SP8192 Family3+ QK5.75 TTT16K LR0.0075 (1.0796 BPB)

by PrzemyslaV88View on GitHub
val_bpb
1.0796
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,989,120 bytes

Training Techniques

Architecture
depth recurrence
3-layer depth recurrence over layers 3-5
parameters: {"layers":3,"start_layer":3,"end_layer":5}
weight tying
SP8192 tokenizer/model stack uses tied embeddings-style compact parameterization as part of the base family
parameters: null
GQA
8 attention heads with 4 KV heads
parameters: {"heads":8,"kv_heads":4}
parallel residuals
Parallel residual connections from layer 7 onward
parameters: {"start_layer":7}
Quantization
GPTQ
bits: 6
scope: matrices and embeddings
Weight Averaging
EMA
parameters: {"decay":0.9965}
Compression
brotli
level: null
Test-Time Training
score-first TTT
parameters: {"chunk_tokens":16384,"learning_rate":0.0075,"epochs":3}
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_frac":0.72}
Optimizer
AdamW
weight_decay: 0.095
momentum: null
other_params: {"matrix_lr":0.022,"warmdown_frac":0.72}
Initialization
OrthoInit

Novel Contributions

  • Completed 8xH100 SP8192 Family3+ run for the 10min/16MB track
  • Raised QK gain initialization to 5.75
  • Used 16K-token score-first TTT with learning rate 0.0075
  • Applied GPTQ SDClip int6 matrices and int8 token embeddings
  • Used EMA with decay 0.9965 and brotli artifact compression
  • Achieved legal score-first TTT exact val_bpb of 1.07955293