Can AI Tell Us If Those Zoom Calls Are Flowing Smoothly? New Study Gives a Thumbs Up

Researchers find machine learning can predict how we rate social interactions in videoconference conversations

Since the onset of the COVID-19 pandemic, workers have spent countless hours in videoconferences—now a fixture of office life. As more people work and live remotely, videoconferencing platforms such as Zoom, MS Teams, FaceTime, Slack, and Discord are a huge part of socializing among family and friends as well. Some exchanges are more enjoyable and flow better than others, raising questions about how the medium of online meetings could be improved in order to raise both efficiency and job satisfaction.

A team of New York University scientists has developed an AI model that can identify aspects of human behavior in videoconferences, such as conversational turn-taking and facial actions, and predict, in real-time, whether or not the meetings are seen as enjoyable and fluid—comfortable and flowing rather than awkward and marked by stilted turn-taking—based on these behaviors.

“Our machine learning model reveals the intricate dynamics of high-level social interaction by decoding subtle patterns within basic audio and video signals from videoconferences,” says Andrew Chang, a postdoctoral fellow in NYU’s Department of Psychology and the lead author of the paper, which appears in the conference publication IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). “This breakthrough represents an important step toward dynamically enhancing videoconference experiences by showing how to avoid conversational derailments before they occur.”

In order to develop this machine-learning model, the researchers trained it on more than 100 person-hours of Zoom recordings, with the model taking as input voice, facial expressions, and body movements to identify disruptive moments—when conversations became unfluid or unenjoyable. More specifically, the scientists trained the model to differentiate between unfluid moments that disrupted a virtual meeting and more fluid exchanges.

Notably, the model gauged conversations with unusually long gaps in turn-taking as less fluid and enjoyable than those in which participants spoke over one another. Put another way, “awkward silences” were found to be more detrimental than the chaotic, enthusiastic dynamics of a heated debate.

To confirm the accuracy of the model’s assessments, an independent team of more than 300 human judges viewed samples of the same videoconference footage, rating the fluidity of the conversations and how much they thought the meeting participants enjoyed the exchanges. Overall, the human raters closely matched the machine-learning model’s assessments.

“Videoconferencing is now a prominent feature in our lives, so understanding and addressing its negative moments is vital for not only fostering better interpersonal communication and connection, but also for improving meeting efficiency and employee job satisfaction,” says Dustin Freeman, a visiting scholar in NYU’s Department of Psychology and the senior author of the paper. “By predicting moments of conversational breakdown, this work can pave the way for videoconferencing systems to mitigate these breakdowns and smooth the flow of conversations by either implicitly manipulating signal delays to accommodate or explicitly providing cues to users, which we are currently experimenting with.”

Leave a Reply Cancel Reply