Poster
ADoPD: A Large-Scale Document Page Decomposition Dataset
Jiuxiang Gu · Xiangxi Shi · Jason Kuen · Lu Qi · Ruiyi Zhang · Anqi Liu · Ani Nenkova · Tong Sun
Halle B
We introduce ADoPD, a large-scale document page decomposition dataset for document understanding, encompassing document entity segmentation, text detection, tagging, and captioning. ADoPD stands out with its novel document taxonomy, meticulously crafted through a data-driven approach enriched by both large-scale pretrained models and human expertise. Our dataset achieves diversity by combining outlier detection with a human-in-the-loop approach. This significant contribution advances the field of document analysis, deepening our insights into document structures and substantially enhancing document processing and analysis techniques. The amalgamation of data-driven exploration, thorough annotation, and the human-in-the-loop methodology paves the way for innovative improvements in document analysis capabilities and the advancement of document processing applications. We conduct a comprehensive evaluation of ADoPD using various methods and demonstrate its effectiveness.