PR #199

closed

Non-record: SWA and doc-isolated eval ablation — two negative findings at stride=64

val_bpb

1.1929

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,819,113 bytes

Training Techniques

Weight Averaging

SWA

parameters: {"snapshots":73,"sample_every_steps":50,"start_step":10000,"accumulation_dtype":"float32"}

Evaluation

sliding window eval

parameters: {"stride":64}

doc-isolated sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_steps":1200}

Sequence Length

sequence_length

train_length: null

eval_length: 64

Quantization

int8

bits: 8

scope: all

Controlled ablation showing SWA does not improve int8 quantization under default warmdown
Controlled ablation showing doc-isolated evaluation hurts at stride=64
Identification of a stride-dependent crossover where doc-isolation can be harmful at short stride but helpful at longer stride
Discovery and fix of a bf16 SWA accumulation bug by accumulating in float32