How do you spot a memorable or “epic” moment in live video content? Meeyoung Cha and Kunwoo Park talk about their recent work in EPJ data science which uses a deep learning model to identify such epic moments.
Photo by Jack B on Unsplash
Live streaming has become a popular internet culture. Platforms like TikTok and Twitch have over 60 to 140 million monthly active users.
Virtually anyone can stream content on these platforms, which makes it difficult to find epic and funny moments due to the sheer volume of seemingly mundane and long videos.
In a new study published in EPJ data science, we show how Artificial Intelligence (AI) can help human editors quickly identify interesting segments of live streaming content.
This decision is made collectively based on audience reactions in chat messages, the structure of video frames, number of views, and streamer information. Among these, emojis and audience reactions act as critical components that guide the AI algorithm.
Deep learning is used to learn characteristics of epic moments from multimodal data to suggest interesting video segments with different contexts including victory, funny, embarrassing and awkward moments.
Tested in a user study, this AI suggestion is comparable to expert suggestions in spotting epic moments.
In order to train the algorithm, we need key data that represent “epicness”. Twitch has manually created “clips” or Twitch highlights, which are 5 to 60 second segments contributed by streamers and viewers.
Figure 1 shows an example of live streaming content with a duration of 11 minutes and 55 seconds. Two segments of this content were highlighted as recommended “clips” that were 53 seconds and 30 seconds long, respectively.
The second clip reached over 170,000 views, suggesting more epic. The illustration also shows user reactions to these selected video segments. Emojis or Twitch-specific emoticons are often expressed in chat.
We’ve gathered two million user-recommended clips and their conversations with users to understand the ingredients for epic moments. Our work defines epic moments as a fun, bite-sized summary of long video content.
Epic Moments are similar to video highlights in that they are both short summaries of long videos, but they work differently. Epic moments represent “enjoyable” moments, while highlights are “informative” in nature.
We found that emotes and user reactions play a vital role in finding epic moments.
Figure 2 shows clustering results for emotes that appear in user chats on the two-dimensional space identified by the t-distributed stochastic neighbor embedding (t-SNE).
The color indicates the category of a cluster and the diagram shows five sample word tokens that are closest to each emote cluster. We can see similar looking emotes work as an emotional expression on Twitch.
These findings are used to create a deep learning model called Multimodal Detection with INTerpretability (MINT), which brings together and analyzes key functions such as chat, video metadata and number of views.
The comprehensive functions from these three areas capture different aspects of epic moments, and the combination of these clues leads to a better prediction.
A user study also confirmed that the algorithmic suggestions were rated as just as pleasant as human-recommended clips.
In addition, the algorithmic suggestions include various contexts such as failed game moments, funny dance moves, a surprise comeback during the game, and moments outside of the game, as shown in Figure 3.
In contrast, most human suggestions contained game-winning moments.
As a growing population spends time watching live streaming content online, AI suggestions can help editors and viewers discover epic moments.
Researchers interested in the codes for the STEM algorithm and the clip dataset for training can find more information on our GitHub page https://github.com/dscig/twitch-highlight-detection