Binary Visual Question Answering using Transformers with raw inputs

In this work, we introduce the Visual-Question-Answering task and the balanced binary visual-question-answering dataset. We propose two models, one that is used as a baseline model which is a latent Joint-Embedding model that utilizes Transformer networks to embed the visual and textual parts of the question. We then propose our main model which is an attention model that also utilizes transformer networks as the backbone and is able to achieve relatively good results and beats our baseline latent Joint-Embedding model with the added benefit of being able to see the attention mask to visualize where the model is looking with respect to the question. Finally, we provide visualizations of our model applied to the test set which shows which parts of the image the model is looking at to answer the question.