PR #2100

open

Follow-up: LongCtx No-QV Prefix3500 GlobalTTT LR 0.0008 — 1.05807 BPB seed42

by someone114514View on GitHub

val_bpb

1.0581

Architecture

Transformer

Optimizer

—

Artifact Size

15,977,802 B

Training Techniques

Quantization

GPTQ-lite

bits: null

scope: model weights

Architecture

SmearGate

BOS-fixed SmearGate and skip-gate style sparse attention gating are part of the inherited architecture.

parameters: null

weight tying

Not explicitly stated in the PR body, but the base recipe is a Transformer-style model; no evidence of weight tying changes in this PR.

parameters: null

Test-Time Training

LoRA TTT

parameters: {"rank":80,"learning_rate_multiplier":0.8,"global_learning_rate":0.0008,"prefix_docs":3500}

Sequence Length

sequence_length

train_length: null

eval_length: 2560

Regularization

logit softcap

parameters: {"asym_logit_rescale":true}

Compression

lrzip

level: 9

Longer phased-TTT prefix of 3500 documents
Lower full-parameter global TTT learning rate of 0.0008
Retained the LongCtx No-QV QK5.25 + AsymLogit + LQER g32/top4 + TTT-local 0.80 stack
Seed-42 follow-up run showing improved TTT gain relative to the quantized checkpoint, though not a new record
Five-knob retune over the parent recipe: MATRIX_LR, LQER_RANK, LQER_ASYM_GROUP, LQER_TOP_K, and TTT_LOCAL_LR_MULT