In-Person Poster presentation / poster accept
Measure the Predictive Heterogeneity
Jiashuo Liu · Jiayun Wu · Renjie Pi · Renzhe Xu · Xingxuan Zhang · Bo Li · Peng Cui
MH1-2-3-4 #139
Keywords: [ Data heterogeneity ] [ predictive heterogeneity ] [ predictive information ] [ Social Aspects of Machine Learning ]
As an intrinsic and fundamental property of big data, data heterogeneity exists in a variety of real-world applications, such as in agriculture, sociology, health care, etc. For machine learning algorithms, the ignorance of data heterogeneity will significantly hurt the generalization performance and the algorithmic fairness, since the prediction mechanisms among different sub-populations are likely to differ. In this work, we focus on the data heterogeneity that affects the prediction of machine learning models, and first formalize the Predictive Heterogeneity, which takes into account the model capacity and computational constraints. We prove that it can be reliably estimated from finite data with PAC bounds even in high dimensions. Additionally, we propose the Information Maximization (IM) algorithm, a bi-level optimization algorithm, to explore the predictive heterogeneity of data. Empirically, the explored predictive heterogeneity provides insights for sub-population divisions in agriculture, sociology, and object recognition, and leveraging such heterogeneity benefits the out-of-distribution generalization performance.