Published in 15th International Conference on Digital Audio Effects, 2012
Localisation of multiple active speakers in natural environments with only two microphones is a challenging problem. Reverberation degrades the performance of speaker localisation based exclusively on directional cues. This paper presents an approach based on audio-visual fusion. The audio modality performs the multiple speaker localisation using them Skeleton method, energy weighting, and precedence effect filtering and weighting. The video modality performs the active speaker detection based on the analysis of the lip region of the detected speakers. The audio modality alone has problems with localisation accuracy, while the video modality alone has problems with false detections. The estimation results of both modalities are represented as probabilities in the azimuth domain. A Gaussian fusion method is proposed to combine the estimates in a late stage. As a consequence, the localisation accuracy and robustness compared to the audio/video modality alone is significantly increased. Experimental results in different scenarios confirmed the improved performance of the proposed method.