Poster
in
Workshop: Setting up ML Evaluation Standards to Accelerate Progress
Are Ground Truth Labels Reproducible? An Empirical Study
Ka Wong · Praveen Paritosh · Kurt Bollacker
Standard evaluation techniques in Machine Learning (ML) and the corresponding statistical inference do not explicitly consider ground truth labels as a source of uncertainty, i.e., they assume that the benchmarks are reproducible. We investigate the reliability of ground truth labels in nine highly cited evaluation datasets. Via replication, we find the majority votes in three datasets to have zero reliability. They cannot be reproduced. The cause of irreproducibility is excessive rater disagreement, as evidenced in the zero inter-rater reliability. Contrary to popular belief, majority voting fails to have a material impact in this case. We conduct a smaller pilot using raters with high qualifications and find significant improvement in reliability across the board. This suggests high quality data collection is still paramount, and cannot be replaced by aggregation. We urge researchers, reviewers, and the publication processes (such as reproducibility checklists) to encourage the measurement and reporting of reproducibility in the ground truth data used in ML evaluation. Towards this end, we publish and release all the replications and associated data to aid assessment of reproducibility of this work.