Skip to yearly menu bar Skip to main content


Poster

Measuring Vision-Language STEM Skills of Neural Models

Jianhao Shen · Ye Yuan · Srbuhi Mirzoyan · Ming Zhang · Chenguang Wang

Halle B
[ ]
Fri 10 May 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract:

We introduce a new challenge to test the STEM skills of neural models. Unlike existing datasets, our dataset requires the understanding of multimodal vision-language information. Our dataset features one of the largest and most comprehensive datasets for the challenge. It includes 448 skills and 1,073,146 questions spanning all STEM (science, technology, engineering, math) subjects. Compared to existing datasets that often focus on examining expert-level ability, our dataset includes fundamental skills and questions designed based on the K-12 curriculum. We also add state-of-the-art foundation models such as CLIP and ChatGPT to our dataset. Results show that the recent model advances only help master a very limited number of lower grade-level skills (2.5% in the third grade) in our dataset. In fact, these models are still well below (averaging 54.7%) the performance of elementary students, not to mention near expert-level performance. To understand and increase the performance on our dataset, we teach the models on a training split of our dataset. Even though we observe improved performance, the model performance remains relatively low compared to average elementary students. To solve STEM problems, we will need novel algorithmic innovations from the community. The code and dataset are available at https://anonymous.4open.science/r/STEM-Dataset-ICLR-2024 and will be made publicly available.

Chat is not available.