PR #199

closed

Non-record: SWA and doc-isolated eval ablation — two negative findings at stride=64

by mrdavtanView on GitHub
val_bpb
1.1929
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,819,113 bytes

Training Techniques

Weight Averaging
SWA
parameters: {"snapshots":73,"sample_every_steps":50,"start_step":10000,"accumulation_dtype":"float32"}
Evaluation
sliding window eval
parameters: {"stride":64}
doc-isolated sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":1200}
Sequence Length
sequence_length
train_length: null
eval_length: 64
Quantization
int8
bits: 8
scope: all

Novel Contributions

  • Controlled ablation showing SWA does not improve int8 quantization under default warmdown
  • Controlled ablation showing doc-isolated evaluation hurts at stride=64
  • Identification of a stride-dependent crossover where doc-isolation can be harmful at short stride but helpful at longer stride
  • Discovery and fix of a bf16 SWA accumulation bug by accumulating in float32