Multimodal Offensive Meme Classification using Transformers and BiLSTM
Roshan Nayak1, B S Ullas Kannantha2, Kruthi S3, C. Gururaj4

1Roshan Nayak*, Department of Electronics and Communication Engineering, B.M.S. College of Engineering, Bengaluru, India. 
2B S Ullas Kannantha, Department of Electronics and Instrumentation Engineering, B.M.S. College of Engineering, Bengaluru, India.
3Kruthi S, Department of Electronics and Instrumentation Engineering, B.M.S. College of Engineering, Bengaluru, India. 
4C. Gururaj, Senior Member IEEE, Department of Electronics and Telecommunication, B.M.S. College of Engineering, Bengaluru, India. 
Manuscript received on January 31, 2022. | Revised Manuscript received on February 05, 2022. | Manuscript published on February 28, 2022. | PP: 96-102 | Volume-11 Issue-3, February 2022. | Retrieval Number: 100.1/ijeat.C33920211322 | DOI: 10.35940/ijeat.C3392.0211322
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: Nowadays memes have become a way in which people express their ideas on social media. These memes can convey various views including offensive ones. Memes can be intended for a personal attack, homophobic abuse, racial abuse, attack on minority etc. The memes are implicit and multi-modal in nature. Here we analyze the meme by categorizing them as offensive or not offensive and this becomes a binary classification problem. We propose a novel offensive meme classification using the transformer-based image encoder, BiLSTM for text with mean pooling as text encoder and a Feed-Forward Network as a classification head. The SwinT + BiLSTM has performed better when compared to the ViT + BiLSTM across all the dimensions. The performance of the models has improved significantly when the contextual embeddings from DistilBert replace the custom embeddings. We have achieved the highest recall of 0.631 by combining outputs of four models using the soft voting technique. 
Keywords: Offensive Meme Classification; BilSTM; Transformer; Pooling; Confusion Matrix;
Scope of the Article: Classification