PR #1953

RECORDopen

Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean)

by andrewbaggio1View on GitHub

val_bpb

1.0586

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,992,914 B

Training Techniques

Sequence Length

sequence_length

train_length: null

eval_length: 2560

Test-Time Training

LoRA TTT

parameters: {"rank":80,"mask":"no_qv","q_lora":0,"v_lora":0,"local_lr_mult":0.75,"phased":true,"num_phases":3,"prefix_docs":2500}

Architecture

SmearGate

BOS-fixed smear gate used in the inherited base stack.

parameters: {"window":12}

Sparse Attention Gate

Sparse attention gating used in the inherited base stack.

parameters: null

LQER

Asymmetric rank-4 low-rank quantization/error recovery component.

parameters: {"rank":4,"top_k":3,"group_size":64,"asym_enabled":true,"asym_group":64}

weight tying

Not explicitly mentioned in the submission text.

parameters: null

Quantization

GPTQ

bits: 6

scope: weights and embeddings

mixed int6/int7

bits: null

scope: weights + embeddings

AWQ-lite

bits: null

scope: GPTQ calibration

Regularization

logit softcap

parameters: {"asymmetric":true}

Evaluation

long context eval

parameters: {"context_length":2560}

Initialization

QK_GAIN_INIT

Per-head learnable Q-gain initialized to 5.25.

Compression

brotli

level: null

Novel Contributions

Extended evaluation and TTT sequence length from 2048 to 2560.
Applied a no_qv TTT mask, disabling Q and V LoRA paths during test-time training.
Reduced TTT local learning-rate multiplier to 0.75.
Changed QK gain initialization to 5.25.
Stacked the new levers on top of the PR #1945 base with AWQ-lite and asymmetric logit rescale.