[visionlist] PhD scholarship : Directing virtual actors by interaction and mutual imitation

Remi Ronfard remi.ronfard at inria.fr
Tue May 15 12:03:33 GMT 2012

We are selecting candidates for a PhD scholarship as part of LABEX 
PERSYVAL in Grenoble, France, on the topic of Directing virtual actors 
by interaction and mutual imitation.

This is a joint project between the IMAGINE team at INRIA and the 
GIPSA-LAB at Grenoble University.

The Phd topic requires a strong background in machine learning and 
computer graphics, excellent academic records, and good programming skills.

Interested students should send their curriculum vitae before May 31 to

Gérard BAILLYGIPSA-LabGerard.bailly at gipsa-lab.grenoble-inp.fr 
<mailto:Gerard.bailly at gipsa-lab.grenoble-inp.fr>334 76 57 47 11
Rémi RONFARDLJKRemi.ronfard at inria.fr <mailto:Remi.ronfard at inria.fr>334 
76 61 53 94

Based on the quality of applications, the attribution of the scholarship 
will be decided on June 30.


The challenge of this project is to propose a system that allows a 
director to control and modify the performance of a virtual actor, by 
demonstration. The system will perform an action-perception loop video 
input: the director plays the scene in front of a camera. The system 
analyzes his diction, his facial expressions and head movements. Then 
the system creates a virtual copy of his performance by chaining 
statistical models that will drive a virtual character with multi-modal 
speech synthesis and animation of a talking head, imitating the 
director. In general, this first attempt will not correspond exactly to 
the sought effect. The director then repeats the sequence, by changing 
his speech and gestures to be better understood. It can also give the 
system of rewards (better, worse) and indications (faster, quieter). The 
iterative system developed will achieve the result in a series of 
interactions where the system tries to learn the required sequence of 
actions to be performed and how to parameterize them.

The originality of the project is to consider the behavior of the two 
interlocutors ¬ multimodal movements of the head and eyes, facial 
expressions and speech ¬ as coupled systems and study the rhythmic 
coordination of all these gestures.

The model of the virtual actor that we would develop will behave as a 
kind of Eliza Doolittle when she repeats her diction exercises to 
reproduce (with difficulty) the instructions of Dr. Higgins (My Fair 
Lady). Another useful reference to motivate and illustrate this project 
is that of theater exercises, where the same phrase is repeated with all 
the intonations possible: Marcel Pagnol Schpountz provides a familiar 
example, when Fernandel repeats the sentence "Anyone sentenced to death 
will have his head cut off" with a large series of expressive attitudes, 
from fear, incredulity and disgust to sarcasm and doubt.

It is assumed that the virtual actor has a rich inventory of multimodal 
behavior and can express a wide variety of mental states. The beginning 
of this thesis will thus consist in a formal modeling of a large part of 
the 412 emotions organized into 24 functional groups by S. Baron Cohen 
[1], for the inventory carried by six English speakers that we will 
adapt to French and that will played by a professional human actor. A 3D 
virtual clone will be endowed with the ability to reproduce those 
emotions on any statement from a multimodal decomposition of these 
behaviors into elementary gestures.

The aim of this thesis is to orchestrate these gestures ¬ selection, 
sequencing, phasing and gradience ¬ as well as increase the inventory to 
mimic the communicative intentions of an actor's director, operating 
through demonstration and reward.

Scientific challenges:
The scientific challenges are twofold: technological and cognitive.
Technological challenges include both the problems of analysis and 
synthesis of multimodal behavior, statistical modeling (including 
stochastic processes such as POMDP (Young 2010; Jurcicek et al. 2011) 
and learning by demonstration with a central issue on the orchestration 
of different dimensions of the animation in terms of sequencing of 
actions (see for example the ability of CRF (Sutton and McCallum 2006) 
to learn complex syntactic relations) and fine tuning of kinematic 
trajectories (see for example the so-called trajectory Hidden Markov 
models introduced by Tokuda et al (Zen et al. 2004; Zen et al. 2011). 
The aim of the project is to model the coupling between director and 
actor in order to take directly into account the satisfactory 
experiences and unsuccessful imitations.

Cognitive challenges concern the study and modeling of perception-action 
links. According to Gallese et al (Gallese et al. 1996; Gallese and 
Goldman 1998 ), the activity of mirror neurons (NM) in the brain of a 
primate or human observer would perform an automatic motor simulation of 
movements performed by the agent in order to represent the intended 
actions of the latter. Following this hypothesis, NM help "mindreading" 
¬ concept introduced by S. Baron-Cohen quoted above ¬ that is to say, 
the psychological understanding of others (or mentalization), the 
ability to "read" in the minds of others and thus to represent their 
mental states to understand and predict their behaviors. The project 
aims, through the analysis of loops of reciprocal imitation and targeted 
questionnaires, to study the mental simulations and behavioral 
strategies implemented by human or virtual observers to discover and 
exploit the capabilities of mind reading of their conversational partner.

The global scientific challenge is to provide a comprehensive cognitive 
architecture that can simulate the dynamic coupling between a real human 
and a virtual human in the context of multimodal interactions. The 
originality of this architecture is to account for phenomena of 
synchrony, imitation and variability specific to human interactions in 
order to increase the credibility of the humanoid.


Bailly, G., O. Govokhina, F. Elisei and G. Breton (2009). "Lip-synching 
using speaker-specific articulation, shape and appearance models." 
Journal of Acoustics, Speech and Music Processing. Special issue on 
"Animating Virtual Speakers or Singers from Audio: Lip-Synching Facial 
Animation" ID 769494: 11 pages.
Bailly, G. and B. Holm (2005). "SFC: a trainable prosodic model." Speech 
Communication 46(3-4): 348-364.
Baron-Cohen, S. (2008). Mind Reading: The Interactive Guide to Emotions. 
London, Jessica Kingsley Publishers: 29 pages.
Bérar, M., G. Bailly, M. Chabanas, M. Desvignes, F. Elisei, M. Odisio 
and Y. Pahan (2006). Towards a generic talking head. Towards a better 
understanding of speech production processes. J. Harrington and M. 
Tabain. New York, Psychology Press: 341-362.
Gallese, V., L. Fadiga, L. Fogassi and G. Rizzolatti (1996). "Action 
recognition in the premotor cortex." Brain 119: 593-609.
Gallese, V. and A. I. Goldman (1998 ). "Mirror neurons and the 
simulation theory of mindreading." Trends in Cognitive Sciences 2(12): 
Gao, X., Y. Su, X. Li and D. Tao (2010). "A Review of Active Appearance 
Models." IEEE Transactions on Systems Man and Cybernetics 40(2): 145-158.
Jurcicek, F., B. Thomson and S. Young (2011). "Natural Actor and Belief 
Critic: Reinforcement algorithm for learning parameters of dialogue 
systems modelled as POMDPs." ACM Transactions on Speech and Language 
Processing 7(3).
Lelong, A. and G. Bailly (2011). Study of the phenomenon of phonetic 
convergence thanks to speech dominoes Analysis of Verbal and Nonverbal 
Communication and Enactment: The Processing Issue. A. Esposito, A. 
Vinciarelli, K. Vicsi, C. Pelachaud and A. Nijholt. Berlin, Springer 
Verlag: 280-293.
Sutton, C. and A. McCallum (2006). An Introduction to Conditional Random 
Fields for Relational Learning. Introduction to Statistical Relational 
Learning. L. Getoor and B. Taskar. Boston, MA, MIT Press.
Young, S. (2010). "Cognitive user interfaces." IEEE Signal Processing 
Magazine 27(3): 128-140.
Zen, H., Y. Nankaku and K. Tokuda (2011). "Continuous stochastic feature 
mapping based on trajectory HMMs." IEEE Trans. on Audio, Speech, and 
Language Processing 19(2): 417-430.
Zen, H., K. Tokuda and T. Kitamura (2004). An introduction of trajectory 
model into HMM-based speech synthesis. ISCA Speech Synthesis Workshop. 
Pittsburgh, PE, pp. 191-196.

Rémi Ronfard, IMAGINE team, INRIA / LJK, Grenoble
Tel 334 76 61 53 03 Cell. 336 71 08 88 81

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://visionscience.com/pipermail/visionlist/attachments/20120515/fec9c2d8/attachment.htm>

More information about the visionlist mailing list