PR #1963

open

SP8192 + LongCtx NoQV QK5.25 Prefix2750 — 1.05827 BPB (seed 42)

by someone114514View on GitHub

val_bpb

1.0583

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,978,173 bytes

Training Techniques

Evaluation

long context eval

parameters: {"context_length":2560}

Test-Time Training

score-first TTT

parameters: {"rank":80,"mask":"no_qv","q_lora":0,"v_lora":0,"local_lr_mult":0.75,"beta2":0.99,"weight_decay":0.5,"phased":true,"num_phases":3,"prefix_docs":2750}

Sequence Length

sequence_length

train_length: null

eval_length: 2560

Optimizer

AdamW

weight_decay: 0.5

momentum: null

other_params: {"beta2":0.99,"matrix_lr":0.026,"min_lr":0.1,"grad_clip_norm":0.3}

Regularization

weight decay

parameters: {"value":0.5}

Architecture

SmearGate

Smear gate enabled for gating behavior during training.

parameters: {"gate_window":12}

Gated Attention

Sparse attention gating and gated attention quant gate are enabled.

parameters: {"sparse_attn_gate_scale":0.5,"sparse_attn_gate_init_std":0,"gated_attn_quant_gate":1}

Quantization

QAT

bits: null

scope: model weights

GPTQ

bits: null

scope: calibration

Compression

lrzip

level: null

Novel Contributions

Narrow follow-up to PR #1953 that changes only the phased-TTT prefix schedule
Increases PHASED_TTT_PREFIX_DOCS from 2500 to 2750
Keeps no_qv TTT mask, TTT local LR multiplier, QK gain stack, tokenizer, CaseOps data, and long-context evaluation unchanged
Provides reproducible CaseOps dataset downloader/validator and full seed 42 run log
Achieves a near-tied BPB result while staying under the 16 MB cap