Building song recommendation system using Million Song Dataset

Aki Kapoor
7 min readMay 1, 2021


This project focusses on building a recommendation system using different techniques and machine learning models after analyzing dataset.

Technology used: — Machine learning, Cloud, Kafka, mySQL, Data visualization.

Platform used: — Linux, Scala, Python, Pig, Hive, Tableau, PowerBI.


The dataset which is available to study about is Million Song Dataset which contains audio features and metadata. It has four datasets: — audio, genre, main dataset and tasteprofile.

Audio further contains attributes, features and statistics. Attributes has 13 attribute files in .csv format. Feature has 13 directories, and each directory contains 8-part files in csv.gz format. Statistics contains sample properties file in .csv.gz format.

Genre contains three files- msd-MASD-style assignment in .tsv format, msd-MASD-genreAssignment in .tsv format and msd-top-MASD-genreAssignment in .tsv format. It contains genre labels like Rap, Jass etc

Main dataset contains summary which contains two files: — analytics which is in .csv.gz format and metadata which is in .csv.gz format.

Tasteprofile contains mismatches and triplets. Mismatches further contains sid_matches_manually_accepted file in .txt format and sid_mismatches file in .txt format. Triplets contains 8-part files in .tsv.gz format.

For the audio, data is stored in the form of string and float type. Genre contains data in string type. The triplets contain data in integer and string type and mismatched files has data stored in string type.

For the repartition, spark do not split the files in compress format. Our files are in .gzip format so spark can’t split it. One way is uncompressing file by gunzip to .csv format and achieve parallism. I did parallism with the help of 4 executers. Iused 2 cores per executers. So, number of partitions in 32.

Audio files with counts: -

Tasteprofile contains two text files in .txt format: — sid_matches_manually_accepted.txt which contains the song with correct matches and sid_mismatches.txt which contains songs with incorrect matches.

To remove songs which were mismatched, following was done: -

1. By left anti join, song_id of the mismatched songs was removed.

2. Data in triplet was used to find the correct song_id.

Number of songs after removing mismatched were 45795111.

Next, attribute names and types were extracted from the audio features and schema in audio for the features was found. The schema is then mapped with the feature file. The dictionary is further created so that data could be used for future use.


The small dataset that I choose is method of moments as it has only 994623 records. The features that has the co-relation index more than 0.8 are treated as co-related in our case. The method_of_moment_overall_standarddeviation2 and method_of_moment_overall_standarddeviation3 had a corelation equal to 0.86 ,method_of_moment_overall_standarddeviation3 and method_of_moment_overall_standarddeviation4 had a corelation equal to 0.85, method_of_moment_overall_standarddeviation4 and method_of_moment_overall_standarddeviation5 have corelation equal to 0.94. These features are renamed as standard_deviation2, standard_deviation3, standard_deviation4, standard_deviation5.

The average value for average_2 and average_3 is 0.90 and average_4 and average_5 is 0.98.

Average_1, average_2, average_3, average_5 has positive value while average_4 has negative value.

For loading the MAGD, the schema is defined and then data is loaded from the file msd-MAGD-genreAssignment.tsv.

For doing visualisation, we look to the columns track_id and genre. Since both the variables are qualitative, we used a bar plot between track_id ncount and grouped it by genre.

Merge is done so that every song has a label. The datasets: — genre and audio feature are merged via column track_id.

Three classification models that are used are: -

1. Gradient boost

2. Decision trees

3. Random forest

These models are used as they have high predictive accuracy and easy iterability. Moreover, data is very co-relative so using other models is not a good choice.

Further, Rap column is filtered with column as the binary preference. 1 value is given to the Rap track and 0 value is given to Non-rap track.

The class balance of the binary label is coming to be 0.05228.

In spark, the models are fitted and then test data is predicted. Over sampling is done to increase the performance as it enables us to increase the training data.

The F1 score and AUROC improved for random forest after oversampling.


First, precision, recall and accuracy were treated as the parameters to evaluate performance. Since, it was not much difference in that, hence AUROC and F1 score is used.

Before cross validation, gradient boosting model was performing best. After cross validation, Random forest performed the best in comparison to gradient boosting and decision tree.

One algorithm which is capable of multiclass classification and binary classification is Random forest as it easily deals with class imbalance problems, is handy and easy to understand.

The genre column is converted into integer index with the help of StringIndexer. Firstly, vector assembler function is used to join all the features and convert them into a single column. After this, because of data value, more data is used as training data.

The performance is finally improved after inclusive columns.

For random forest, the efficiency has increased.


The number of unique songs in the dataset is 378310 while the number of unique users is 1019318.

To calculate the number of different songs that have most active user played, first I found the number of active users and then the number of different songs played by each user. Approximately 5% are the unique songs in the dataset.


Since the tail of the data points is towards right, hence the shape is right skewed.

The collaborative filtering, in this case, is basically user based i.e ignoring users who rarely listen songs and song based i.e ignoring songs that are very less played. The songs which have been played less than N times and users who have listened to fewer than M songs in total are removed based upon the percentile rank scores. The percentile value below 25% is removed. After removing, the number of songs that are left in the dataset is 744910. Before creating the recommendation system, the updated dataset is created and joined with tasteprofile dataset giving number of rows equal to 42293405. It is almost 93% of the total data in actual, hence recommendation system is applied. To build a recommendation system, the condition is that the data in test set should be at least 25%. I took the ration of 70:30 in training data and testing data. In order to apply ALS, song_id and user_id are converted into numeric values which are initially in string format. The ALS is fitted in the training data and predictions are made on the test data.

While comparing the method, it is found that there is no match between matrix and Recommendation system. It means it is performing poorly. The precision is 4.4% for precision value = 5, NDCG is 4.5% for value = 10 and MAP (Mean average Precision) is 0.01.

Since precision value is 4.4% for 5, it means that model is not able to predict songs according to user preference.

In general, limitation is that the MAP, NDCG requires ordering in order to perform. It will allow scaling and removing sparse space. Hybrid method of recommendation can be used as an alternate method as it explains how the combination between features and attributes can be done according to the requirement in industry in the real world.



Aki Kapoor

Masters in Applied data science, University of Canterbury, New Zealand. Data scientist who loves to play with the data and make sense from it.