PR #2060

open

Record: LongCtx No-QV QK5.25 + AsymLogit + LQER g32/top4 + TTT-local 0.80 — 1.05792 BPB 3-seed mean

by S0urC10udView on GitHub
val_bpb
1.0579
Architecture
Transformer
Optimizer
Artifact Size
15,971,753 bytes

Training Techniques

Quantization
GPTQ
bits: null
scope: mixed precision weights
mixed int7/int8
bits: null
scope: embeddings and blocks
Architecture
SmearGate
BOS-fixed smear gating used in sparse attention gating.
parameters: null
weight tying
Tied embedding-style parameter sharing is implied by the parent recipe only if present; not explicitly changed here.
parameters: null
Gated Attention
Sparse attention gating with skip gates and no-QV masking.
parameters: null
Test-Time Training
LoRA TTT
parameters: {"rank":80,"local_lr_mult":0.8,"prefix_docs":3000}
Evaluation
long context eval
parameters: {"eval_length":2560}
Sequence Length
sequence_length
train_length: null
eval_length: 2560
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85,"min_lr":0.1}
Regularization
logit softcap
parameters: {"asym_logit_rescale":true}
Compression
lrzip
level: 9
Other
other
Five-knob hyperparameter retune of the parent record: MATRIX_LR, LQER_RANK, LQER_ASYM_GROUP, LQER_TOP_K, and TTT_LOCAL_LR_MULT.
parameters: {"matrix_lr":0.028,"lqer_rank":2,"lqer_asym_group":32,"lqer_top_k":4,"ttt_local_lr_mult":0.8}

Novel Contributions

  • Retuned five exposed environment-variable hyperparameters on top of the parent #2007 recipe.
  • Improved 3-seed mean validation BPB to 1.05792053 while staying under the 10-minute / 16 MB limits.
  • Used finer LQER asym-quant groups, lower LQER rank, one extra top-K corrector slot, slightly higher matrix LR, and hotter local TTT learning rate.
  • Kept train_gpt.py byte-identical to the parent submission.