PR #1103

open

Non-record: Negative results — eval-time interventions, mixed-precision GPTQ, loss truncation

by abaybektursunView on GitHub

val_bpb

1.1147

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Evaluation

sliding window eval

parameters: {"windows":32}

long context eval

parameters: {"context_length":4096}

kNN-LM

parameters: {"layers":1,"k":64,"distance":"L2","store_size":2000000}

kNN-LM

parameters: {"layers":11,"similarity":"cosine","store_size":2000000}

Quantization

GPTQ

bits: null

scope: attention int4, MLP_down int8

Regularization

loss truncation

parameters: {"percentile":95}

Sequence Length

sequence_length

train_length: 2048

eval_length: 4096

Evaluated eval-time interventions on the PR #1019 stack and found all tested methods negative
Tested single-layer and multi-layer kNN-LM retrieval augmentation
Tested sliding-window logit averaging and extended-context evaluation
Explored mixed-precision GPTQ by reallocating bits from attention to MLP_down
Tested loss truncation at the 95th percentile during training
Reported that uniform int6 quantization and standard cross-entropy remained locally optimal