PR #1103

open

Non-record: Negative results — eval-time interventions, mixed-precision GPTQ, loss truncation

by abaybektursunView on GitHub
val_bpb
1.1147
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Evaluation
sliding window eval
parameters: {"windows":32}
long context eval
parameters: {"context_length":4096}
kNN-LM
parameters: {"layers":1,"k":64,"distance":"L2","store_size":2000000}
kNN-LM
parameters: {"layers":11,"similarity":"cosine","store_size":2000000}
Quantization
GPTQ
bits: null
scope: attention int4, MLP_down int8
Regularization
loss truncation
parameters: {"percentile":95}
Sequence Length
sequence_length
train_length: 2048
eval_length: 4096

Novel Contributions

  • Evaluated eval-time interventions on the PR #1019 stack and found all tested methods negative
  • Tested single-layer and multi-layer kNN-LM retrieval augmentation
  • Tested sliding-window logit averaging and extended-context evaluation
  • Explored mixed-precision GPTQ by reallocating bits from attention to MLP_down
  • Tested loss truncation at the 95th percentile during training
  • Reported that uniform int6 quantization and standard cross-entropy remained locally optimal