top of page
Writer's pictureDario Pittera

Italian NLP Analysis on "Festival di Sanremo 2024" YouTube comments

The Festival di Sanremo, Italy's most prestigious singing contest, has once again captivated audiences worldwide with its dazzling performances. This year's festival saw an array of talented singers competing for the coveted title, each bringing their unique style and flair to the stage.


In this current post, I'd like to investigate the viewers' perception and opinions about the singers. Specifically, we will create a Natural Language Processing (NLP) dashboard with Streamlit, a powerful Python library for building data-driven web apps.


Let's have a closer look at how this dashboard operates and what insights it uncovers.





Understanding the code

Importing libraries

Let's import necessary libraries such as Streamlit, Pandas, Matplotlib, Seaborn, NLTK, and more. These libraries provide various functionalities for data manipulation, visualization, and NLP processing.


import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# NLP libraries
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.util import ngrams
italian_stopwords = stopwords.words('italian')
english_stopwords = stopwords.words('english')
from PIL import Image

# sentiment analysis for Italian language
from italian_sentiment import SentimentAnalyzer

TF_USE_LEGACY_KERAS=1

# Initialize the SentimentAnalyzer
analyzer = SentimentAnalyzer()

st.set_option('deprecation.showPyplotGlobalUse', False) # enable/disable warnings

Data visualisation functions

The plotFromData() function is responsible for generating a word cloud from the comments provided in a CSV file. It preprocesses the comments, removes stopwords, and creates a visually appealing word cloud. To speed up the process, we will save it in the cache through the decorator "@st.cache_data".


@st.cache_data
def plotFromData(csv):

    # Import data
    df = pd.read_csv(csv, encoding='latin-1', skiprows=0, index_col=0)
    df.dropna(subset=["comment"], inplace=True)
    text = " ".join(review for review in df.comment)
    # Create stopwords list
    stopwords = set(italian_stopwords)
    stopwords.update(["canzone", "song", "molto", "sanremo", "festival", title, title.lower()])
    stopwords.update(english_stopwords)
    # Generate word cloud
    wordcloud = WordCloud(stopwords=stopwords, background_color="white", width=1600, height=800).generate(text)
    # Display the word cloud
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    st.pyplot(plt.gcf())

The get_top_ngram() function calculates the top n-grams (sequences of n words) present in the comments, providing insights into frequently occurring phrases.


def get_top_ngram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)
    bag_of_words = vec.transform(corpus)  
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
    return words_freq[:10]

Finally, a function that sets the thresholds for the sentiment analysis and "decide" what emotion to output.


def chooseMood(mood_value):
    match mood_value:
        case _ if mood_value <= 0.2:
            mood_name = "anger"
        case _ if 0.2 < mood_value <= 0.4:
            mood_name = "smile"
        case _ if 0.4 < mood_value <= 0.5:
            mood_name = "smile"
        case _ if mood_value > 0.5:
            mood_name = "happy"   
    return mood_name

Dashboard creation


The Streamlit dashboard is organized into several sections, each providing different insights into the audience's reactions.

  1. Word Cloud: Visualizes the most common words used in the comments. We then added a filter to accept additional stop words to personalise the word cloud output.

  2. N-gram Analysis: Displays frequently occurring phrases (n-grams) in the comments. Here, we added a field to choose how many co-occurring words we want to consider.

  3. Language Distribution: Illustrates the nationality distribution of the comments through a donut chart.

  4. Sentiment Analysis: Analyzes the sentiment of the comments, indicating whether they are positive or negative.

  5. Short Summary: Provides a concise summary of the comments.


Further, there are two small examples before the dashboard that show some analysis for Angelina Mango "La rondine" performance (example 1), and another that let upload a csv with the comments for a singer (example 2).


Let's have a look at the code:


# Set Streamlit page configuration
st.set_page_config(
    page_title="NLP on Sanremo 2024 singers",
    page_icon="🎤",
    layout="wide",
)

st.title("NLP on Sanremo 2024 singers")

# Example 1: Displaying Angelina Mango's data
st.header("Example 1")
# Function to plot word cloud from uploaded data
plotFromData("angelina_lower.csv")

# Example 2: Uploading custom data
st.header("Example 2 - Upload your file")
uploaded_file = st.file_uploader(label="Upload your comments file")
if uploaded_file is not None:
    plotFromData(uploaded_file)

And now the dashboard! We start my creating a new section and by importing the data frame with all the comments for each singer. I obtained the comments by using Google APIs and taking the first 400 comments for each singer from the official RAI's videos, to get the fresher comments unrelated to the official songs' videos.


# Example 3: Simple Dashboard
st.header("Example 3 - Simple Dashboard")

st.write("Now, throught the Google APIs we collected 400 comments for each of the contestant of the Festival \
          of Sanremo 2024. We will now build a dashboard to show the comments' juice at a glance.\
          Specifically, we show a wordcloud, a variable n-grams, the language of the comment and \
          sentiment analysis of the comments. Finally, we will show the most salient comments \
          - I won't take responsibility for what will be shown 😅.")

# import new df
df = pd.read_csv("singers.csv", encoding='latin', skiprows=0, index_col=0)
df.dropna(subset=["comment"], inplace=True)
df["singer"] = [x.title() for x in df["singer"]]

Now we create a filter that will let you select the singer's name:


# top-level filters
singer_filter = st.selectbox("Select singer", sorted(pd.unique(df["singer"])))

# dataframe filter
df = df[df["singer"] == singer_filter]

And now we create a sort of grid with two columns and two rows that will accomodate all the pieces of our dashboard:


# First row with two columns
col1, col2 = st.columns(2)

with col1:
    newPar()

    st.markdown("#### Word cloud")
    new_stopwords = st.text_input("Insert additional stopwords, comma space separated:")
    new_stopwords = new_stopwords.split(", ")
    singers_names = [x for x in df["singer"].iloc[0].split()] # remove singer's name from words cloud

    # take all the comments and just count words in our comments
    text = " ".join(review for review in df.comment)
    st.text("The comments considered total to {} words.".format(len(text)))

    # START PLOTTING
    # Create stopword list:
    stopwords = set(italian_stopwords)
    stopwords.update(english_stopwords)
    stopwords.update(["canzone", "canzoni", "song", "songs", "festival", "sanremo", "br", \
                      "www", "youtube", "sempre", "mai", "comunque", "quot", "minuto", \
                      "minuti", df["singer"].iloc[0],df["singer"].iloc[0].lower(), *new_stopwords, *singers_names
                      ])

    # Generate a word cloud image
    wordcloud = WordCloud(stopwords=stopwords, background_color="white", width=800, height=400).generate(text)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    # plt.margins(x=0,y=0)
    st.pyplot(plt.gcf(), clear_figure=True)

with col2:
    newPar()
    
    st.markdown("#### N-gram")

    how_many_together = st.number_input("No of n-grams:", min_value=0, max_value=10, value=4, step=1, on_change=None)
    
    top_n_bigrams = get_top_ngram(df.comment,how_many_together)[:10]     
    top_n_bigrams = [gram for gram in top_n_bigrams if not any(st in gram for st in stopwords)]
    
    x,y=map(list,zip(*top_n_bigrams)) 
    sns.barplot(x=y,y=x)
    plt.title(f"{how_many_together}-gram")    
    st.pyplot(plt.gcf(), clear_figure=True)

In these first two rows, we will display our word cloud and the n-grams. Each of these, have an additional field that let you insert new stop words and decide how many n-grams you want to display.


Now let's continue with the other row:


col3, col4 = st.columns(2)

with col3:
    # languageDetection(df)
    st.markdown("#### Comments from the world")

    nations_label = list(df[df['singer']==singer_filter]['lang'].value_counts().index)
    nations_count = list(df[df['singer']==singer_filter]['lang'].value_counts())

    # DONUT PLOT
    # The slices will be ordered and plotted counter-clockwise.
    labels = nations_label
    sizes = nations_count
    explode = np.zeros(len(nations_label))
    explode[0] = 0.04

    _, texts, autotexts = plt.pie(sizes, labels=labels, pctdistance=0.75, labeldistance=1.1, explode=explode,
            startangle=0, autopct='%i%%', shadow=False, rotatelabels=False)

    for text in texts:
        text.set_color('black')
    for autotext in autotexts:
        autotext.set_color('white')
        autotext.set_size(11)
        autotext.set_weight("bold")
            
    # Draw a circle at the pie's centre
    centre_circle = plt.Circle((0,0),0.55,color='white', fc='white',linewidth=1.25)
    fig = plt.gcf()
    fig.gca().add_artist(centre_circle)

    # Set aspect ratio to be equal to draw a circular shape
    plt.axis('equal')
    st.pyplot(plt.gcf(), clear_figure=True)

with col4:

    st.markdown("#### Sentiment Analysis - this might take a while")  
    st.write('Angry icons created by Eucalyp - [Flaticons](https://www.flaticon.com/free-icons/angry)')

    query = df.loc[(df["singer"] == singer_filter) & (df["lang"] == "it")]["comment"]

    # Predict sentiment for each sentence
    results = analyzer.predict_sentiment(query)

    emo_res = pd.DataFrame(results, columns=['sentiment','pos', 'neg']);

    # check the overall mood and select the correct image
    mood = emo_res["pos"].mean() - emo_res["neg"].mean()                
    mood_name = chooseMood(mood)
    im = Image.open(f'./emoji/{mood_name}.png')
    im.thumbnail((150, 150))

    # Convert relative positions to figure coordinates
    fig = plt.gcf()
    ax = plt.gca()
    x_fig, y_fig = ax.transAxes.transform([1.65, 1.65])

    sns.barplot(emo_res, palette=["#77DD77", "#FF6961"])
    plt.title(f"Overall comments' mood for {df['singer'].iloc[0]}");
    plt.xlabel("Valence")
    plt.ylabel("Sentiment strength")
    plt.ylim(0,1)
    plt.figimage(im, x_fig, y_fig)
    st.pyplot(plt.gcf(), clear_figure=True)

Here, we visualise the comment's nationality (the algorithm makes classification mistakes), and then we extract the general sentiment depending on the words used and display a simple bar plot for positive Vs negative sentiment and sum up the emotion through an icon.


Finally, we generate a short summary of the comments using a pre-trained summarization model. We're utilizing the Hugging Face pipeline function to load the summarization pipeline. The pipeline function allows us to easily access pre-trained models and apply them to specific tasks. Here, we're specifying the task as "summarization" and choosing the model "facebook/bart-large-cnn", which is a variant of the BART (Bidirectional and Auto-Regressive Transformers) model developed by Facebook AI. The output is a condensed version of the input comments, capturing the essential information conveyed by the original text.


st.markdown("#### Short summary - this might take a while")  

# Load the summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Example text to summarize
text = " ".join(review for review in df[df["singer"]==singer_filter]["comment"])

# Truncate the input text to the maximum supported length
max_input_length = summarizer.model.config.max_position_embeddings
truncated_text = text[:max_input_length]

# Summarize the text
summary = summarizer(truncated_text, max_length=150, min_length=50, do_sample=False)

# Print the summary
st.write(summary[0]['summary_text'])

Here the result:



Conclusion

The Festival di Sanremo is not only a celebration of music but also a reflection of the audience's sentiments and preferences. By combining the power of Streamlit with various NLP techniques, this dashboard offers comprehensive insights into comment trends, sentiment distributions, and frequently occurring phrases, enabling a deeper understanding of audience reactions at a glance.

Comments


bottom of page