PR #1963

open

SP8192 + LongCtx NoQV QK5.25 Prefix2750 — 1.05827 BPB (seed 42)

by someone114514View on GitHub
val_bpb
1.0583
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,978,173 bytes

Training Techniques

Evaluation
long context eval
parameters: {"context_length":2560}
Test-Time Training
score-first TTT
parameters: {"rank":80,"mask":"no_qv","q_lora":0,"v_lora":0,"local_lr_mult":0.75,"beta2":0.99,"weight_decay":0.5,"phased":true,"num_phases":3,"prefix_docs":2750}
Sequence Length
sequence_length
train_length: null
eval_length: 2560
Optimizer
AdamW
weight_decay: 0.5
momentum: null
other_params: {"beta2":0.99,"matrix_lr":0.026,"min_lr":0.1,"grad_clip_norm":0.3}
Regularization
weight decay
parameters: {"value":0.5}
Architecture
SmearGate
Smear gate enabled for gating behavior during training.
parameters: {"gate_window":12}
Gated Attention
Sparse attention gating and gated attention quant gate are enabled.
parameters: {"sparse_attn_gate_scale":0.5,"sparse_attn_gate_init_std":0,"gated_attn_quant_gate":1}
Quantization
QAT
bits: null
scope: model weights
GPTQ
bits: null
scope: calibration
Compression
lrzip
level: null

Novel Contributions

  • Narrow follow-up to PR #1953 that changes only the phased-TTT prefix schedule
  • Increases PHASED_TTT_PREFIX_DOCS from 2500 to 2750
  • Keeps no_qv TTT mask, TTT local LR multiplier, QK gain stack, tokenizer, CaseOps data, and long-context evaluation unchanged
  • Provides reproducible CaseOps dataset downloader/validator and full seed 42 run log
  • Achieves a near-tied BPB result while staying under the 16 MB cap