January-2021-WaiLEARN-Analysis-Spotify-Dataset

This is the WaiLEARN Group 001 repository for team collaboration.

View the Project on GitHub women-in-ai-ireland/January-2021-WaiLEARN-Analysis-Spotify-Dataset

Analysis of Spotify Data

The Project

Spotify is the largest audio streaming and media provider in the world i.e. it provides digital music, podcasts and video streaming services. It has 345 million active users till date and helpful for the artists to reach their audience and hence, can imagine the amount of data Spotify collects. In this project we have explored and analysed Spotify dataset which is available on Kaggle to understand how music has changed over the period of time and to do sentiment analysis for the most popular artist TheBeatles. As it is quite tricky to analyze and understand large datasets, visualizing the data to understand trends and features in the dataset can be of great help. Following are the steps we have performed in this project to accomplish the project objective:

1.Exploratory Data Analysis in Tableau

2.Twitter Sentiment Analysis in R

3.Statistical Analysis on three different artists in SPSS

4.Generating Insights

The analysis done can help understand some interesting trends of songs over the period from 1921 to 2021 and understand how music has changed over time.

About data

The data is sourced from Kaggle website (https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks) where the data is collected from Spotify API. The data contains more than 175,000 songs along with the information of artists, genre and year. Attributes mentioned here are calculated by Spotify.

Following are the feature with its description[1] -ID: ID of the track. -Acousticness: It represents if the track is acoustic or not and is a confidence measure from 0 to 1. 1 represents high confidence that the song is acoustic. -Danceability: it represents the range from 0 to 1 indicating how suitable the song is for dancing. -Energy: It represents intensity and activity of a track ranging from 0 to 1. -Duration_ms: the duration of a track in milliseconds. -Valence: It ranges from 0 to 1 describing the positivity conveyed by a track. Tracks with high valence sound more positive while tracks with lower valence values sound low and sad and negative. -Popularity: the popularity of the track ranges from the values 0 to 100. -Tempo: measures overall tempo of a track and the value ranges from -5 to150. In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. -Liveliness: It represents the liveness of the track by detecting the presence of the audience in the recording. The value ranges from 0 to 1. -Loudness: It measured loudness of a track having values ranging from -60 to 0. -Speechiness: It represents the presence of spoken words in the track which is measured from 0 to 1. The track having more speech-like recording has a value closer to 1. -Year: It ranges from 1921 to 2021.

Analysis

Initially we explored and studied the data in Tableau to understand the features. As the data is quite vast and from 1921 to 2021 visualizing the data in Tableau helped a lot to understand how the features changed over time in terms of popularity. Let’s go through the animations created in Tableau to understand some interesting trends in data over time. Animation of loudness VS Popularity over the period of time : Visualization link: https://public.tableau.com/profile/pooja5954#!/vizhome/SpotifyEDA_16156686828640/LoudnessVSPopularity-Animation

From the following figures we can see that in the initial period loud songs were not very popular and as the time went by we can see that from 1950 loud songs started gaining popularity. In 2020 we can see that there are two groups where loud songs had higher and lower popularity.

1 2 1

Animation of Speechiness VS Popularity over the period of time :

Visualization link: https://public.tableau.com/profile/pooja5954#!/vizhome/SpotifyEDA_16156686828640/SpeechinessVSPopularity-Animation

From the below figure we can see that on average the speechiness feature has very low value and the popularity value in terms of speechiness feature is also not very high. In 1921 songs with very low values of speechiness were not popular. Even after the speechiness of the songs increased but the popularity didn’t increase. In 2020 we can see that songs with low spechiness value started gaining popularity. This is something interesting to note!

1 2 3

Animation of Energy VS Popularity over the period of time :

https://public.tableau.com/profile/pooja5954#!/vizhome/SpotifyEDA_16156686828640/EnergyVSPopularity-Animation On an average the value of Energy feature is 0.5 which is very good but the popularity is not that high. Over the period of time the Energy of songs increased and also gained popularity.

4 5 6

Animation of Instrumentalness VS Popularity over the period of time :

Visualization link: https://public.tableau.com/profile/pooja5954#!/vizhome/SpotifyEDA_16156686828640/InstrumentalnessVSPopularity-Animation2

7 8 9

Popularity of artists ranked in descending order:

https://public.tableau.com/profile/pooja5954#!/vizhome/SpotifyEDA_16156686828640/popularartists We can see that over the period of time from 1921-2021 TheBeatles artist is the most popular.

10

Dashboard in Tableau to know more about artists:

Link to the dashboard: https://public.tableau.com/profile/pooja5954#!/vizhome/SpotifyEDA_16156686828640/FinalDashbaord?publish=yes

Here you can see a dashboard created in Tableau for the year 2020. The purpose of this dashboard is to know more about artists i.e. from left to right and top to bottom

  1. We can see artists ranked by popularity for the year 2020.
  2. Based on the artist we can see the songs composed by those artists ranked by popularity.
  3. In the middle we can see a line chart to understand the popularity trend of that artist over the period of time.
  4. At the bottom we can see 4 scatter plot of 4 audio features i.e. Accousticness, Loudness, Danceability and Instrumentalness against Popularity.

11

The dashboard is interactive and if we click on the artists all the information on the dashboard is filtered for that artist. We can easily know the total number songs by that artist, songs ranked by popularity of that artist, the popularity trend of that artist, the nature of song in terms if accousticness,loudness, danceability and instrumentalness against popularity. Below is the snapshot of the dashboard when we click on Workout Music artist, all other information on the dashboard is filtered accordingly.

12

Twitter Sentiment Analysis

This is a data scraping technique that extracts tweet messages from Twitter.To achieve this, we must first create a Twitter developer account, establish a set of access tokens on R and ensure R is connected with Twitter live server. Information is extracted from the text i.e. unstructured data and getting information from unstructured data needs a lot of processing and cleaning of text. In our analysis, we have successfully extracted tweet messages under the hashtag #thebeatles. The overall dataset contained 3638 observations and 90 variables and followed by creating a new data frame with a number of specific variables, including “user_id”, “status_id”, “created_at”, “screen_name”, “text”, “favorite_count”, “retweet_count”, “location” and “verified”. The next step is to clean up the dataset by removing all punctuations, special characters, and symbols.

Location of the users using #thebeatles There are 216 unique locations found out of the entire dataset, and the United Kingdom is situated on the rank one within the top 20 locations and we can say that users from United Kingdom are tweeting the most about #thebeatles.

Here is the wordcloud to visualize the top 20 locations.

13

According to further analysis with the total verified user numbers, there are 98.99% unverified users, and only 1.01% are verified.

14

The Beatles

This is another wordcloud established by all common words from tweet messages extracted under the hashtag #thebeatles.

15

Next is to implement a sentimental package (nrc) to analyze the overall emotion scores, the final output has shown the total sentiment plot combined with different emotions.

16

Since the emotion sentimental outcome has combined with multiple different emotion scores, which is unclear and hard to understand whether more people are having more positive or negative views on #thebeatles, in order to understand the overall positive or negative sentiment of this hashtag, we have used another package (bing) to examine the dataset. The final output has shown there are twice more negative (4781) tweets compared to the positive (2005) ones.

17

According to this sentimental study, although TheBeatles is the top artist with the highest popularity between 1921- 2020 in Spotify API data, but not receiving high positive user engagements on Twitter, this outcome is contradicting initial findings.

Statistical Analysis on SPSS

In this analysis, we have selected three artists randomly from the entire dataset, which includes FrankSin, Pink Floyd, and TheBeatles to study if there’s any statistical significance amongst three artists over popularity.

Descriptive Analysis According to the table below, the overall sample length of the artists is uneven, FrankSin has the largest, with 621 sets of sample, and Pink Floyd has the smallest, with 263 sets of sample only.

18

TheBeatles have the highest popularity mean value with 46.38 and FrankSin has the lowest with 28.07.

19 20 21

Normality test

To understand whether the distribution of popularity feature is normal. Confidence interval is equal to 95%, (α ) = 0.05 Degree of freedom: FrankSin =621 PinkFloy =263 TheBeatles =412

Null hypothesis ->if the P-value is above 0.05, which means the data is not statistically significant. Alternative hypothesis -> if the P-value is below or equal to 0.05, this will mean the data is statistically significant.

According to the output of the normality test from below, three artists are equally having lower than 0.05 on significance level, which means there are statistical differences, the data is not normally distributed.

22

Kruskal-Wallis Test

The data is not normally distributed, and we have more than two independent samples i.e. artists. Therefore, the Kruskal-Wallis hypothesis test is suitable for this circumstance.

23

The final outcome has been rejected from the null hypothesis as the significant level is below 0.05, which proves there are statistical differences across three artists on popularity level.

Correlation

This technique can identify if there is any correlation between two variables. The final outcome has shown the correlation score is 0.425, which means there is a positive weak correlation found between popularity and year.

24

Conclusion

Overall we can see that the music/songs have changed over time from 1921-2021 and will continue to change as the preferences/likes/dislikes of people changes and will get more accurate data from now on because of the digital age and the ease of access to listen to songs in Spotify. Major changes started occurring from the year 1950 and are continuously changing which we can see from the Tableau animation for different audio features against popularity. The twitter analysis for the hashtag thebeatles helped us to understand most of the users tweeting about #thebeatles are from United Kingdom and the sentiment of those tweets is mixed and difficult to identify clearly whether it is positive or negative and hence, we calculated sentimental score using two packages in R. From statistical analysis in SPSS we got to know that the popularity level of three artists is statistically different.

References

Nijkamp, R., 2018. Prediction of product success: explaining song popularity by audio features from Spotify data (Bachelor’s thesis, University of Twente).

Contributors

Karin Cheong | LinkedIn|GitHub</br>

Pooja Chordiya | LinkedIn|GitHub</br>