Poster
in
Workshop: Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)
Instruction-Finetuned Foundation Models for Multimodal Web Navigation
Hiroki Furuta · Ofir Nachum · Kuang-Huei Lee · Yutaka Matsuo · Shixiang Gu · Izzeddin Gur
Keywords: [ Instruction Finetuning ] [ large language models ] [ foundation models ] [ web navigation ] [ Multimodal Document Understanding ] [ decision making ]
We propose an instruction-aligned multimodal agent for autonomous web navigation -- i.e., sequential decision making tasks employing a computer interface. Our approach is based on supervised finetuning of vision and language foundation models on a large corpus of web data consisting of webpage screenshots and HTML. Specifically, we use vision transformers on sequences of web page screenshots to extract patch-level image features. These features are concatenated with embedding of tokens in HTML documents. Using an instruction-finetuned large language model, we jointly encode both vision and HTML modalities and decode web actions such as click and type. We show that our method outperforms previous approaches by a significant margin, even in handling out-of-distribution HTML and compositional tasks. On the MiniWoB benchmark, we improve previous approaches using only HTML input by more than 17.7%, even surpassing the performance of RL-finetuned models. On the recent WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing state-of-the-art PaLM-540B. We also collect 347K gold demonstrations using our trained models, 29 times larger than prior work, and make them available to promote future research in this area. We believe that our work is a step towards building capable and generalist decision making agents for computer interface.