ICLR Poster Understanding Embodied Reference with Touch-Line Transformer

In-Person Poster presentation / poster accept

Understanding Embodied Reference with Touch-Line Transformer

Yang Li · Xiaoxue Chen · Hao Zhao · Jiangtao Gong · Guyue Zhou · Federico Rossano · Yixin Zhu

MH1-2-3-4 #35

[ Abstract ]

[ Poster] [ OpenReview]

Abstract:

We study embodied reference understanding, the task of locating referents using embodied gestural signals and language references. Human studies have revealed that, contrary to popular belief, objects referred to or pointed to do not lie on the elbow-wrist line, but rather on the so-called virtual touch line. Nevertheless, contemporary human pose representations lack the virtual touch line. To tackle this problem, we devise the touch-line Transformer: It takes as input tokenized visual and textual features and simultaneously predicts the referent’s bounding box and a touch-line vector. Leveraging this touch-line prior, we further devise a geometric consistency loss that promotes co-linearity between referents and touch lines. Using the touch line as gestural information dramatically improves model performances: Experiments on the YouRefIt dataset demonstrate that our method yields a +25.0% accuracy improvement under the 0.75 IoU criterion, hence closing 63.6% of the performance difference between models and humans. Furthermore, we computationally validate prior human studies by demonstrating that computational models more accurately locate referents when employing the virtual touch line than when using the elbow-wrist line.

Chat is not available.