How do you spot a memorable or “epic” moment in live video content? Meeyoung Cha and Kunwoo Park talk about their recent work in EPJ data science which uses a deep learning model to identify such epic moments.

Live streaming has become a popular internet culture. Platforms like TikTok and Twitch have over 60 to 140 million monthly active users.

Virtually anyone can stream content on these platforms, which makes it difficult to find epic and funny moments due to the sheer volume of seemingly mundane and long videos.

In a new study published in EPJ data science, we show how Artificial Intelligence (AI) can help human editors quickly identify interesting segments of live streaming content.

This decision is made collectively based on audience reactions in chat messages, the structure of video frames, number of views, and streamer information. Among these, emojis and audience reactions act as critical components that guide the AI ​​algorithm.

Deep learning is used to learn characteristics of epic moments from multimodal data to suggest interesting video segments with different contexts including victory, funny, embarrassing and awkward moments.

Tested in a user study, this AI suggestion is comparable to expert suggestions in spotting epic moments.

In order to train the algorithm, we need key data that represent “epicness”. Twitch has manually created “clips” or Twitch highlights, which are 5 to 60 second segments contributed by streamers and viewers.

Figure 1 shows an example of live streaming content with a duration of 11 minutes and 55 seconds. Two segments of this content were highlighted as recommended “clips” that were 53 seconds and 30 seconds long, respectively.

Figure 1. Segments of interest from the live streaming are highlighted as two separate clips that received 21 views and 170,000 views, respectively. By collecting these clips, we can create an algorithm to automatically detect epic moments.

© The authors (2021)

The second clip reached over 170,000 views, suggesting more epic. The illustration also shows user reactions to these selected video segments. Emojis or Twitch-specific emoticons are often expressed in chat.

We’ve gathered two million user-recommended clips and their conversations with users to understand the ingredients for epic moments. Our work defines epic moments as a fun, bite-sized summary of long video content.

Epic Moments are similar to video highlights in that they are both short summaries of long videos, but they work differently. Epic moments represent “enjoyable” moments, while highlights are “informative” in nature.

We found that emotes and user reactions play a vital role in finding epic moments.

Figure 2 shows clustering results for emotes that appear in user chats on the two-dimensional space identified by the t-distributed stochastic neighbor embedding (t-SNE).

The color indicates the category of a cluster and the diagram shows five sample word tokens that are closest to each emote cluster. We can see similar looking emotes work as an emotional expression on Twitch.

Figure 2. Example of algorithmic proposals for epic moments. The MINT model can detect (a) failures, (b) funny, (c) dubbing and (d) non-playing moments.

© The authors (2021)

These findings are used to create a deep learning model called Multimodal Detection with INTerpretability (MINT), which brings together and analyzes key functions such as chat, video metadata and number of views.

The comprehensive functions from these three areas capture different aspects of epic moments, and the combination of these clues leads to a better prediction.

A user study also confirmed that the algorithmic suggestions were rated as just as pleasant as human-recommended clips.

In addition, the algorithmic suggestions include various contexts such as failed game moments, funny dance moves, a surprise comeback during the game, and moments outside of the game, as shown in Figure 3.

Figure 3. Sample emotes and associated text for each cluster. Cluster representation diagram of the embed vectors of each emote (above) and sample emotes and associated text tokens (below). Representations are plotted by t-SNE and related tokens are selected by the distance between the emote cluster and the word vector.

© The authors (2021)

In contrast, most human suggestions contained game-winning moments.

As a growing population spends time watching live streaming content online, AI suggestions can help editors and viewers discover epic moments.

Researchers interested in the codes for the STEM algorithm and the clip dataset for training can find more information on our GitHub page


Please enter your comment!
Please enter your name here