PR #2100
openFollow-up: LongCtx No-QV Prefix3500 GlobalTTT LR 0.0008 — 1.05807 BPB seed42
by someone114514View on GitHub
val_bpb
1.0581
Architecture
Transformer
Optimizer
—
Artifact Size
15,977,802 B
Training Techniques
Quantization
GPTQ-lite
bits: null
scope: model weights
Architecture
SmearGate
BOS-fixed SmearGate and skip-gate style sparse attention gating are part of the inherited architecture.
parameters: null
weight tying
Not explicitly stated in the PR body, but the base recipe is a Transformer-style model; no evidence of weight tying changes in this PR.
parameters: null
Test-Time Training
LoRA TTT
parameters: {"rank":80,"learning_rate_multiplier":0.8,"global_learning_rate":0.0008,"prefix_docs":3500}
Sequence Length
sequence_length
train_length: null
eval_length: 2560
Regularization
logit softcap
parameters: {"asym_logit_rescale":true}
Compression
lrzip
level: 9
Novel Contributions
- Longer phased-TTT prefix of 3500 documents
- Lower full-parameter global TTT learning rate of 0.0008
- Retained the LongCtx No-QV QK5.25 + AsymLogit + LQER g32/top4 + TTT-local 0.80 stack
- Seed-42 follow-up run showing improved TTT gain relative to the quantized checkpoint, though not a new record
- Five-knob retune over the parent recipe: MATRIX_LR, LQER_RANK, LQER_ASYM_GROUP, LQER_TOP_K, and TTT_LOCAL_LR_MULT