Single Source One Shot Reenactment using Weighted motion From Paired Feature Points
Soumya Tripathy, Juho Kannala and Esa Rahtu
Code coming soon
1 Abstract
Image reenactment is a task where the target object in
the source image imitates the motion represented in the driving image. One of the most common reenactment tasks is
face image animation. The major challenge in the current
face reenactment approaches is to distinguish between facial motion and identity. For this reason, the previous models struggle to produce high-quality animations if the driving and source identities are different (cross-person reenactment). We propose a new (face) reenactment model
that learns shape-independent motion features in a selfsupervised setup. The motion is represented using a set of
paired feature points extracted from the source and driving
images simultaneously. The model is generalised to multiple reenactment tasks including faces and non-face objects
using only a single source image. The extensive experiments
show that the model faithfully transfers the driving motion
to the source while retaining the source identity intact.
2 Key Idea
Face landmarks or keypoint based models1, 2 generate high-quality talking heads for self reenactment, but often fail in cross-person reenactment where the source and driving image have different identities. The main reason is that landmarks/keypoints are person-specific and carry facial shape information in terms of pose independent head geometry. Any differences of shape between source and driving heads are reflected in the facial motion (through landmarks or keypoints) and lead to a talking head that can not faithfully retain the identity of the source’s person. This effect can be seen in Figure 1 for faces and in Figure 3 for non-face objects using a keypoint based reenactment model like FOM 1. Furthermore, these models use each keypoint independently to affect the motion of its neighborhood pixels which makes the output highly dependent on the quality of the keypoints or land- marks. Any noisy keypoint prediction may severely distort the facial shape and thereby generate low-quality talking heads of the source as shown in Figure 1.
Figure 1: . Illustration of drawbacks in keypoints/landmarks based reenactment models. In both cases, the reenactment is performed using FOM and the keypoints are drawn on the source and driving images. In Reenactment-1, the head structure difference between the source and driving is reflected in the output (bottom image) as the source’s facial structure and identity are distorted. In the Reenactment-2, one of the key points (in the red box) is slightly displaced manually from its original position to show its effect on the output. The degradation in the output quality shows the overall system performance is highly dependant on the keypoint detectors.
Considering these issues, we propose a new (face) reenactment model that learns shape-independent motion features in a self-supervised setup. The motion is represented using a set of paired feature points extracted from the source and driving images simultaneously. The model is generalised to multi- ple reenactment tasks including faces and non-face objects using only a single source image. The complete block diagram of our model is given in Figure 3 and demo is presented in the youtube video attached in this page. The extensive experiments show that the model faithfully transfers the driving motion to the source while retaining the source identity intact. For further details please refer to our paper.
Figure 2: The complete block diagram of the proposed reenactment model. It processes the source and driving images in five steps as 1. Encoding the images using image embedder, 2. Extracting paired-feature-points using a transformer, 3. Estimating the motion from paired-feature-points, 4. Converting the motion to the warping field and 5. Using the source image with the warping field in the generator to produce the final output. The transformer module is expanded at the bottom to showcase its building blocks in detail.
3 Reenacting non-face objects
The proposed formulation does not make any assumptions on the reenacted object type. Therefore, the same model can be also trained without modifications to reenact other objects besides faces.
Figure 3: Qualitative comparison of proposed model with FOM on a. Tai-chi-HD, and b. MGif datasets. Our model keeps the source shape and driver’s motion intact at the output unlike FOM. More results can be seen in the video.
3.1 Some Examples on Bair action-conditioned robot pushing dataset
4 Citation
If you find this useful in your research work then please cite us as:
@misc{tripathy2021single, title={Single Source One Shot Reenactment using Weighted motion From Paired Feature Points}, author={Soumya Tripathy and Juho Kannala and Esa Rahtu}, year={2021}, eprint={2104.03117}, archivePrefix={arXiv}, primaryClass={cs.CV}}
5 Other related works
@InProceedings{Tripathy_2021_WACV, author = {Tripathy, Soumya and Kannala, Juho and Rahtu, Esa}, title = {FACEGAN: Facial Attribute Controllable rEenactment GAN}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, year = {2021} } @InProceedings{Tripathy_2020_WACV, author = {Tripathy, Soumya and Kannala, Juho and Rahtu, Esa}, title = {ICface: Interpretable and Controllable Face Reenactment Using GANs}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {March}, year = {2020} }
Footnotes:
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: Conference on Neural Information Processing Systems (NeurIPS). (2019)
Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V.: Few-shot adversarial learning of realistic neural talking head models. In: Proceedings of the IEEE International Conference on Computer Vision. (2019) 9459–9468.