PR #2100

open

Follow-up: LongCtx No-QV Prefix3500 GlobalTTT LR 0.0008 — 1.05807 BPB seed42

by someone114514View on GitHub
val_bpb
1.0581
Architecture
Transformer
Optimizer
Artifact Size
15,977,802 B

Training Techniques

Quantization
GPTQ-lite
bits: null
scope: model weights
Architecture
SmearGate
BOS-fixed SmearGate and skip-gate style sparse attention gating are part of the inherited architecture.
parameters: null
weight tying
Not explicitly stated in the PR body, but the base recipe is a Transformer-style model; no evidence of weight tying changes in this PR.
parameters: null
Test-Time Training
LoRA TTT
parameters: {"rank":80,"learning_rate_multiplier":0.8,"global_learning_rate":0.0008,"prefix_docs":3500}
Sequence Length
sequence_length
train_length: null
eval_length: 2560
Regularization
logit softcap
parameters: {"asym_logit_rescale":true}
Compression
lrzip
level: 9

Novel Contributions

  • Longer phased-TTT prefix of 3500 documents
  • Lower full-parameter global TTT learning rate of 0.0008
  • Retained the LongCtx No-QV QK5.25 + AsymLogit + LQER g32/top4 + TTT-local 0.80 stack
  • Seed-42 follow-up run showing improved TTT gain relative to the quantized checkpoint, though not a new record
  • Five-knob retune over the parent recipe: MATRIX_LR, LQER_RANK, LQER_ASYM_GROUP, LQER_TOP_K, and TTT_LOCAL_LR_MULT