PR #1021
openNon-record: MC Dropout ensembling is negative for small LMs
by abaybektursunView on GitHub
val_bpb
1.3250
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Regularization
dropout
parameters: {"rates":[0.3,0.05]}
Evaluation
MC Dropout ensembling
parameters: {"k":16}
Novel Contributions
- Evaluated MC Dropout ensembling for a 17M-parameter language model
- Showed that averaging 16 dropout samples at inference does not improve BPB
- Found deterministic single-pass inference outperforms MC Dropout at both tested dropout rates
- Argued that dropout-induced sub-network diversity is insufficient at this scale