PR #1953

RECORDopen

Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean)

by andrewbaggio1View on GitHub
val_bpb
1.0586
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,992,914 B

Training Techniques

Sequence Length
sequence_length
train_length: null
eval_length: 2560
Test-Time Training
LoRA TTT
parameters: {"rank":80,"mask":"no_qv","q_lora":0,"v_lora":0,"local_lr_mult":0.75,"phased":true,"num_phases":3,"prefix_docs":2500}
Architecture
SmearGate
BOS-fixed smear gate used in the inherited base stack.
parameters: {"window":12}
Sparse Attention Gate
Sparse attention gating used in the inherited base stack.
parameters: null
LQER
Asymmetric rank-4 low-rank quantization/error recovery component.
parameters: {"rank":4,"top_k":3,"group_size":64,"asym_enabled":true,"asym_group":64}
weight tying
Not explicitly mentioned in the submission text.
parameters: null
Quantization
GPTQ
bits: 6
scope: weights and embeddings
mixed int6/int7
bits: null
scope: weights + embeddings
AWQ-lite
bits: null
scope: GPTQ calibration
Regularization
logit softcap
parameters: {"asymmetric":true}
Evaluation
long context eval
parameters: {"context_length":2560}
Initialization
QK_GAIN_INIT
Per-head learnable Q-gain initialized to 5.25.
Compression
brotli
level: null

Novel Contributions

  • Extended evaluation and TTT sequence length from 2048 to 2560.
  • Applied a no_qv TTT mask, disabling Q and V LoRA paths during test-time training.
  • Reduced TTT local learning-rate multiplier to 0.75.
  • Changed QK gain initialization to 5.25.
  • Stacked the new levers on top of the PR #1945 base with AWQ-lite and asymmetric logit rescale.