PR #2060

open

Record: LongCtx No-QV QK5.25 + AsymLogit + LQER g32/top4 + TTT-local 0.80 — 1.05792 BPB 3-seed mean

by S0urC10udView on GitHub

val_bpb

1.0579

Architecture

Transformer

Optimizer

—

Artifact Size

15,971,753 bytes

Training Techniques

Quantization

GPTQ

bits: null

scope: mixed precision weights

mixed int7/int8

bits: null

scope: embeddings and blocks

Architecture

SmearGate

BOS-fixed smear gating used in sparse attention gating.

parameters: null

weight tying

Tied embedding-style parameter sharing is implied by the parent recipe only if present; not explicitly changed here.

parameters: null

Gated Attention

Sparse attention gating with skip gates and no-QV masking.

parameters: null

Test-Time Training

LoRA TTT

parameters: {"rank":80,"local_lr_mult":0.8,"prefix_docs":3000}

Evaluation

long context eval

parameters: {"eval_length":2560}

Sequence Length

sequence_length

train_length: null

eval_length: 2560

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85,"min_lr":0.1}

Regularization

logit softcap

parameters: {"asym_logit_rescale":true}

Compression

lrzip

level: 9

Other

other

Five-knob hyperparameter retune of the parent record: MATRIX_LR, LQER_RANK, LQER_ASYM_GROUP, LQER_TOP_K, and TTT_LOCAL_LR_MULT.

parameters: {"matrix_lr":0.028,"lqer_rank":2,"lqer_asym_group":32,"lqer_top_k":4,"ttt_local_lr_mult":0.8}

Novel Contributions

Retuned five exposed environment-variable hyperparameters on top of the parent #2007 recipe.
Improved 3-seed mean validation BPB to 1.05792053 while staying under the 10-minute / 16 MB limits.
Used finer LQER asym-quant groups, lower LQER rank, one extra top-K corrector slot, slightly higher matrix LR, and hotter local TTT learning rate.
Kept train_gpt.py byte-identical to the parent submission.