Audio-Visual Kinship Verification in the Wild

Kinship verification is a challenging problem, where recognition systems are trained to establish a kin relation between two individuals based on facial images or videos. However, due to variations in capture conditions (background, pose, expression, illumination and occlusion), state-of-the-art systems currently provide a low level of accuracy. As in many visual recognition and affective computing applications, kinship verification may benefit from a combination of discriminant information extracted from both video and audio signals. In this paper, we investigate for the first time the fusion audio-visual information from both face and voice modalities to improve kinship verification accuracy. First, we propose a new multi-modal kinship dataset called TALking KINship (TALKIN), that is comprised of several pairs of video sequences with subjects talking. State-of-the-art conventional and deep learning models are assessed and compared for kinship verification using this dataset. Finally, we propose a deep Siamese network for multi-modal fusion of kinship relations. Experiments with the TALKIN dataset indicate that the proposed Siamese network provides a significantly higher level of accuracy over baseline uni-modal and multi-modal fusion techniques for kinship verification. Results also indicate that audio (vocal) information is complementary and useful for kinship verification problem.

Wu Xiaoting, Granger Eric, Kinnunen Tomi H., Feng Xiaoyi, Hadid Abdenour

A4 Article in conference proceedings

2019 International Conference on Biometrics, ICB 2019

X. Wu, E. Granger, T. H. Kinnunen, X. Feng and A. Hadid, "Audio-Visual Kinship Verification in the Wild," 2019 International Conference on Biometrics (ICB), Crete, Greece, 2019, pp. 1-8, doi: 10.1109/ICB45273.2019.8987241