As the pandemic changes our conception of what can be done remotely, virtual and augmented reality meetings and events online have become increasingly important. With the help of computers transmitting our communication, it is possible to create more elaborate and immersive gaming and meeting interfaces. These new ways of humans being with one another leverage advanced technologies such as computer vision, VR, and AR, or rely on user gesture and speech input via camera and microphone in novel ways.
Our Lead AI Scientist Hedvig Kjellström has been dedicating her academic career to researching the interface between humans and robots, observing how AI and computer vision can help machines understand us humans better in terms of what we say, how we gesture, and how we move. In this interview, Hedvig lets us into her world of advanced human-centered AI.
Hedvig, your research at KTH Royal Institute of Technology in Stockholm has focused on AI technologies and human communication. How do you see this growing field of human-centered AI?
– In order to better communicate through virtual ways, we humans need technology to understand us better and then, synthesize our communication. The first need focuses on us humans successfully communicating to a machine what we want to be transferred, and the second aims at a machine communicating our intent to another human.
When it comes to technology needed for understanding human communication, there have been tremendous advances in the field. Today, there exist various techniques based on deep learning that focus on extracting communicative content from observed human gaze and body motion from videos. Another related field that has been growing fast is the speech-to-text interpretation: with the emergence of deep learning, various commercial systems such as Siri and Alexa have been developed. The technology is now mature enough, and the use of the spoken interface will likely increase in the future as more use cases come up.
If we need machines to transfer what we want to say, we need to develop technology for synthesizing human-like communication. With the urgent need to improve remote work contexts through virtual conferences, meetings, and other gatherings, there’s also been a significant rise of interest to improve human communication synthesis in the gaming industry.
In computer graphics, generating human-like motion and animating avatars, e.g. for animating characters in games, is increasingly getting more attention. In 2020 and 2021 we’ve seen innovative virtual experiences and gatherings happen which in the future could leverage more the animation of body language in robots and virtual avatars in human-computer interfaces (see example of the Gesticulator below).
In the research field one major theme regarding social robots and systems such as Siri or Alexa is the dialogue systems they use to generate the semantic output of the agent, in other words, generating the text or words they “speak”. This can be seen as the “brain” of the AI-driven tool.
Describe one example of a human-centered AI project you’re involved in.
– At KTH Royal Institute of Technology I supervise a research project called Gesticulator, where we give robots and virtual avatars a human-like body language. The method is based on deep learning and trained with examples of how humans gesture while speaking. The method is then used together with the speech generation of the avatar, and gestures are generated that fit the avatar’s speech.
In an extension project called GestureBot (see the image below), the gesture generation was integrated into a complete web-based dialogue system – you can follow the link and interact with the gesturing avatar in real-time!
In other projects, we develop methods for agents to interpret also the non-verbal communication of human users. As mentioned above, understanding and communicating back is crucial – a complete human-agent interaction system will need to both interpret and generate both verbal and non-verbal communication.
In a gaming context, avatars having a more human-like way of gesturing could make the experience more immersive. The work done in Gesticulator and other similar projects is a step on the way towards that goal.
What is the main challenge in your work, trying to solve communications between machines and people?
– The biggest challenge is that human communication has a really complex structure. There exist also a lot of variability between individuals and cultures. It’s hard to program complex dependencies on underlying factors such as the person’s mood, previous experience, or contextual knowledge.
Moreover, the signals (a speech sound, images of the person) can be really noisy, so it is tricky for a computer to filter out the important information from all the unimportant factors, such as the color of the person’s clothing, sounds, and objects in the background or the variation of lighting, to mention a few.
As you already have 25+ years of experience in computer vision, give us some perspective on how things have been advancing during that time.
– Deep learning has made it all possible, before it was very difficult to reliably track a human in a video sequence. I spent my whole PhD work from 1997 to 2001 on that. Nowadays any iPhone can do this in real-time, which is incredible.
This is of course both thanks to better algorithms, us scientists being able to build on the earlier work done by other people, and better computing power.
However, the major breakthrough has been the deep learning idea of formulating learning problems in a parallel fashion so that the computations can be split up on a huge number of computing units, GPUs.
How do you see the future of human-centered AI unfold?
– The current big thing to solve is to make deep learning methods interpretable and explainable – to enable humans to understand what is going on inside the learning algorithm. Deep learning methods often are referred to as “black-box methods”, where you don’t understand why the decision or prediction is what it is. It’s interesting to develop “grey-box methods” where you can trace the decisions being made by the network on its way from input to output.
For example, if you have a method that classifies facial expressions into a number of emotions, a black-box method would only take in the image as input and produce a class label as output – “angry”, whereas a grey-box method might in addition explain that it paid attention to the shape of the eyebrows and how tight the lips were.
It will also be important with methods that can make use of and weigh together information from many different sources in the learning process. For example, in the emotion recognition example, it’s relevant to encode knowledge from cognitive psychology on how humans express emotions. This is done to a certain extent in current systems, e.g. by using the Facial Action Coding System FACS.
Eventually, more interpretable and explainable technologies will help us humans leverage these technologies in everyday use, as only when we can understand, we can start building trust.