PR #259

open

submission: QK Gain Init 1.2 + Sliding Window Eval (stride=64)

by outsourc-eView on GitHub
val_bpb
1.5879
Architecture
Optimizer
Artifact Size

Training Techniques

Initialization
QK Gain Init
Uses QK_GAIN_INIT=1.2 instead of the default 1.5 to improve attention stability during short training runs.
Evaluation
sliding window eval
parameters: {"stride":64,"batch_seqs":32}
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • QK gain initialization with QK_GAIN_INIT=1.2 for improved training stability
  • Sliding window evaluation with stride=64 and batch size of 32 sequences
  • Added forward_logits() and eval_val_sliding() for eval-only long-context scoring
  • Reported improved validation performance from both initialization and evaluation changes