Joint Audio-Video Driven Facial Animation

Abstract: Automatic facial animation is a research topic of broad and current interest with widespread impact on various applications. In this paper, we present a novel joint audio-video driven facial animation system. Unlike traditional methods, we incorporate a large vocabulary continuous speech recognition (LVCSR) system to obtain phoneme alignments. The use of LVCSR reduces the high error rate associated with the traditional phoneme recognizer. We also introduce a knowledge guided 3D blendshapes modeling for each phoneme to avoid collecting training data and introducing bias from computer vision generated targets. To further improve the quality, we adopt video tracking and jointly optimize the facial animation by combin- ing both sources. In the evaluations, we present both objective study and several subjective studies on three settings: audio-driven, video-driven, and joint audio-video driven. We find that the quality of our proposed system’s facial animation generation surpasses that of the recent state-of-the-art systems.

Index Terms: large vocabulary continuous speech recognition (LVCSR), phoneme alignment, lip sync, facial animation.