TCE-DBF: textual context enhanced dynamic bimodal fusion for hate video detection

TCE-DBF: textual context enhanced dynamic bimodal fusion for hate video detection
Haitao Xiong, Wei Jiao, Yuanyuan Cai
Data Technologies and Applications, Vol. 59, No. 2, pp.201-215

The occurrence and dissemination of hate videos in social media platform could pose serious harm to both society and individuals. However, the characteristics of the hate videos increase the difficulty of detection task. Hate content is usually presented in a relatively covert manner in videos, and textual content in videos plays an important role in hate video detection. In this work, we propose a textual context enhanced dynamic bimodal fusion (TCE-DBF) method for hate video detection.

The proposed method TCE-DBF introduces dynamic modality gate (DMG) and bimodal fusion transformer network to dynamically integrate multimodalities. Moreover, in order to enhance textual modality in videos, two types of textual context from the video are taken as the input of TCE-DBF. One is extracted from video frames in visual modality. The other is extracted from audio in acoustic modality. Specially, TCE-DBF splits the original audio and learns the sequence representation to capture acoustic temporal information.

Hate video detection has been one of the hotspots in recent works. However, it still faces two serious challenges. The first challenge is the hate content in videos presented in multimodalities. The second challenge is how to evaluate the importance of different modalities for multimodal fusion modeling. TCE-DBF aims to tackle these challenges. Experimental results on hate video dataset HateMM demonstrate that TCE-DBF outperforms the state-of-the-art methods, and the visualization results show that textual modality plays a more important role in hate video detection. Therefore, it is vital to consider the text in videos.

TCE-DBF can be utilized to effectively detect hate videos on social media. Besides transcript, TCE-DBF considers text in video frames, which makes detection more accurate. Meanwhile, to better achieve multimodal fusion, TCE-DBF uses DMG and bimodal fusion transformer network to dynamically assign different weights to three modalities and integrate them. The proposed TCE-DBF is novel in terms of capturing multimodal features, enhancing the textual modality and achieving high detection performance for hate video detection.

Accessibility