Poster
CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding
eslam Abdelrahman · Mohamed Ayman Mohamed · Mahmoud Ahmed · Habib · Mohamed Elhoseiny
Halle B
3D visual grounding is the ability to localize objects in 3D scenes conditioned onan input utterance. Most existing methods devote the referring head to localize thereferred object directly. However, this approach will fail in complex scenarios andnot illustrate how and why the network reaches the final decision. In this paper,we address this question “Can we design an interpretable 3D visual groundingframework that has the potential to mimic the human perception system?”. To thisend, we formulate the 3D visual grounding problem as a sequence-to-sequence(Seq2Seq) task by first predicting a chain of anchors and then utilizing them to pre-dict the final target. Following the chain of thoughts approach enables us to decom-pose the referring task into interpretable intermediate steps, which in turn, booststhe performance and makes our framework extremely data-efficient. Interpretabil-ity not only improves the overall performance but also helps us identify failurecases. Moreover, our proposed framework can be easily integrated into any existingarchitecture. We validate our approach through comprehensive experiments on theNr3D and Sr3D benchmarks and show consistent performance gains compared toexisting methods without requiring any manually annotated data. Furthermore, ourproposed framework, dubbed CoT3DRef, is significantly data-efficient, whereaswhen trained only on 10% of the data, we match the SOTA performance that trainedon the entire data. The code is available at https://cot3dref.github.io/.