PR #199
closedNon-record: SWA and doc-isolated eval ablation — two negative findings at stride=64
by mrdavtanView on GitHub
val_bpb
1.1929
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,819,113 bytes
Training Techniques
Weight Averaging
SWA
parameters: {"snapshots":73,"sample_every_steps":50,"start_step":10000,"accumulation_dtype":"float32"}
Evaluation
sliding window eval
parameters: {"stride":64}
doc-isolated sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":1200}
Sequence Length
sequence_length
train_length: null
eval_length: 64
Quantization
int8
bits: 8
scope: all
Novel Contributions
- Controlled ablation showing SWA does not improve int8 quantization under default warmdown
- Controlled ablation showing doc-isolated evaluation hurts at stride=64
- Identification of a stride-dependent crossover where doc-isolation can be harmful at short stride but helpful at longer stride
- Discovery and fix of a bf16 SWA accumulation bug by accumulating in float32