PR #259
opensubmission: QK Gain Init 1.2 + Sliding Window Eval (stride=64)
by outsourc-eView on GitHub
val_bpb
1.5879
Architecture
—
Optimizer
—
Artifact Size
—
Training Techniques
Initialization
QK Gain Init
Uses QK_GAIN_INIT=1.2 instead of the default 1.5 to improve attention stability during short training runs.
Evaluation
sliding window eval
parameters: {"stride":64,"batch_seqs":32}
Sequence Length
sequence_length
train_length: null
eval_length: null
Novel Contributions
- QK gain initialization with QK_GAIN_INIT=1.2 for improved training stability
- Sliding window evaluation with stride=64 and batch size of 32 sequences
- Added forward_logits() and eval_val_sliding() for eval-only long-context scoring
- Reported improved validation performance from both initialization and evaluation changes