Skip to content

SofiaTorchia/ML-challenge

Repository files navigation

Machine Learning Challenge

Machine Learning course 2021-2022, Computer Science Department.
Masters' Degree in Applied Mathematics, Sapienza University of Rome.
Exam date: June 13th 2022

Dataset and task description:

The provided dataset is a modified noisy version of the original dataset described in [1].
This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal of the task is to predict the number of shares in social networks (popularity).

Here is a brief description of the dataset's features:

index namedescription
0urlURL of the article
1timedeltaDays between the article publication and the dataset acquisition
2n_tokens_titleNumber of words in the title
3n_tokens_contentNumber of words in the content
4n_unique_tokensRate of unique words in the content
5n_non_stop_wordsRate of non-stop words in the content
6n_non_stop_unique_tokensRate of unique non-stop words in content
7num_hrefsNumber of links
8num_self_hrefsNumber of links to other articles published by Mashable
9num_imgsNumber of images
10num_videosNumber of videos
11average_token_lengthAverage length of the words in the content
12num_keywordsNumber of keywords in the metadata
13data_channel_is_lifestyleIs data channel 'Lifestyle'?
14data_channel_is_entertainmentIs data channel 'Entertainment'?
15data_channel_is_busIs data channel 'Business'?
16data_channel_is_socmedIs data channel 'Social Media'?
17data_channel_is_techIs data channel 'Tech'?
18data_channel_is_worldIs data channel 'World'?
19kw_min_minWorst keyword (min. shares)
20kw_max_minWorst keyword (max. shares)
21kw_avg_minWorst keyword (avg. shares)
22kw_min_maxBest keyword (min. shares)
23kw_max_maxBest keyword (max. shares)
24kw_avg_maxBest keyword (avg. shares)
25kw_min_avgAvg. keyword (min. shares)
26kw_max_avgAvg. keyword (max. shares)
27kw_avg_avgAvg. keyword (avg. shares)
28self_reference_min_sharesMin. shares of referenced articles in Mashable
29self_reference_max_sharesMax. shares of referenced articles in Mashable
30self_reference_avg_sharessAvg. shares of referenced articles in Mashable
31weekday_is_mondayWas the article published on a Monday?
32weekday_is_tuesdayWas the article published on a Tuesday?
33weekday_is_wednesdayWas the article published on a Wednesday?
34weekday_is_thursdayWas the article published on a Thursday?
35weekday_is_fridayWas the article published on a Friday?
36weekday_is_saturdayWas the article published on a Saturday?
37weekday_is_sunday Was the article published on a Sunday?
38is_weekendWas the article published on the weekend?
39LDA_00Closeness to LDA topic 0
40LDA_01Closeness to LDA topic 1
41LDA_02Closeness to LDA topic 2
42LDA_03Closeness to LDA topic 3
43LDA_04Closeness to LDA topic 4
44global_subjectivityText subjectivity
45global_sentiment_polarityText sentiment polarity
46global_rate_positive_wordsRate of positive words in the content
47global_rate_negative_words Rate of negative words in the content
48rate_positive_wordsRate of positive words among non-neutral tokens
49rate_negative_wordsRate of negative words among non-neutral tokens
50avg_positive_polarityAvg. polarity of positive words
51min_positive_polarityMin. polarity of positive words
52max_positive_polarityMax. polarity of positive words
53avg_negative_polarityAvg. polarity of negative words
54min_negative_polarityMin. polarity of negative words
55max_negative_polarityMax. polarity of negative words
56title_subjectivityTitle subjectivity
57title_sentiment_polarityTitle polarity
58abs_title_subjectivityAbsolute subjectivity level
59abs_title_sentiment_polarityAbsolute polarity level
60sharesNumber of shares (target)

Pre-processing and dataset analysis

The dataset is loaded and cleaned (see the cleaned version). In this notebook feature importance analysis is then performed by removing redundant features. The data is aferwards scaled to perform model selection. For this phase the following steps are considered:

  1. The target feature is discretized (number of classes must be $\geq$ 5).

  2. The dataset is split to perform cross validation.

  3. The following models are trained:

  • Decision Trees
  • Support Vector Machines
  • Random Forest
  • MLPNs
  1. Hyper-parameter tuning is performed and discussed



The scaled dataset looks like this:

scaled_dataset

Discretizing the target feature

Before performing feature importance we must define what the target-feature classes are.Using Kmeans we can visualize how the column "shares" could be partitioned in clusters. I chose to visualize 5 clusters.
If we divided the target column in 5 classes based on kmean algorithm we would obtain highly unbalanced classes with cardinalities $37166$, $6$, $29$, $256$, $2183$.
We can visualize how the point are partitioned using Kmeans:

kmeans

The 5 classes found with KMeans are highly unbalanced. Increasing the number of clusters would not bring to balanced classes, since the points with the highest values would be further split. Oversampling the minority classes would lead to too many synthetic points, and this would bring us too far from real data. The only thing that we could do is undersampling the majority class and then create 5 classes using a quantile splitting.
The cardinality of the new 5 classes are: $4482$, $4481$, $4114$, $4875$, $4522$.

Feature selection

First the data are split into training set and testing set, thens random forest algorithm is applied in order to select features based on the classifier feature importance.

Model selection

For hyperparameter tuning I decided to use GridSearchCV. First the scaled dataset is split into training and testing set The best hyperparameters for a specific algorithm are searched using half of the data (the training set). The classifier with better performances (highest accuracy) is used for the whole dataset and evaluated using cross validation.

SVMs

For SVM the following hyperparameters have been trained:

hyperparameters_1 = [{'kernel': ['linear'], 'C': [1,10,100,1000],'max_iter':[1e6],'tol':[0.01]}]

hyperparameters_2 = [{'kernel': ['poly'], 'C': [1,10,100,1000],'degree':[10,12],'gamma':[0.1,0.05,0.01]}]

hyperparameters_3 = [{'kernel': ['sigmoid'],'C': [1,10,100,1000],'gamma':[0.1,0.001,0.0001]}]

hyperparameters_4 = [{'kernel': ['rbf'],'C': [1,10,100,1000],'gamma':[0.1,0.001,0.0001]}]

SVMs with a linear kernel seems to be the slowest algorithm and its accuracy (with the best hyperparameter, that is $C=100$) is around $20%$. I had to put a limit on the maximum number of iteration and a lower tolerance than the default one in order to put a stop condition on the algorithm.
SVMs with polynomial kernels is also very slow when the degree is low, as it is more similar to svm with linear kernels. With higher degrees (10, 12,.. ) it converges but does not reach a high accuracy ($20%$ in the best case).
SVMs with sigmoid and exponential kernels are much faster and a higher accuracy is reached (around $23%$). The best hyperparameters for SVMs are:
sigmoid kernel, gamma = 0.1, C = 1 for an accuracy of $23,79%$.

Multilayer perceptrions

(using RandomizedSearchCV):

hyperparameters = [{'activation': ['logistic','relu','tanh'],'solver':['sgd','adam','lbfgs'],
                    'max_iter':[10000],'alpha':[1e-5,1e-4,1e-3,1e-2],'learning_rate_init':[0.01,0.005,0.001],
                   'hidden_layer_sizes':[(20,10),(15,7),(20),(15),(25,15,7),(10)],
                   'random_state':[1,2,3,4]}]

Multilayer Perceptron do not perform better than SVMs because they still have low accuracy. In this case, since the number of combinations of hyperparameters is quite high, I chose to use RandomizedGridSearchCV.
The best hyperparameters are:
solver = adam, random_state = 4 (the 4th weight initialization) learning_rate = 0.001, 2 hidden layers with sizes 15 and 7, alpha = 1e-05, activation function = tanh.
The overall accuracy is about $22%$.
I increased the maximum number of iterations to 100000 (200 was the default value) but still lbfgs did not reach convergence in most of the cases.

Decision Trees:

hyperparameters = [{'max_features': [None,'sqrt','log2'],'min_samples_leaf':[2,5,10,20,50],
                    'criterion':['gini','entropy'],'class_weight':['balanced',None]}]

Decision Trees perform much better than SVMs and MLPs as they are faster and more accurate. Accuracy reaches $28.8%$ if min_samples_leaf = 50. This condition implies that a split of the dataset is considered only if it leaves at least 50 training samples in each of the left and right branches. This reduces the risk for overfitting.

Random Forests

hyperparameters = [{'max_features': [None,'sqrt','log2'],'min_samples_leaf':[2,5,10,20,50],
                    'criterion':['gini','entropy'],'class_weight':['balanced',None],'n_estimators':[10,50,100]}]

Random Forests have the best performances: the best accuracy estimation reached is $33,15%$ (for an overall accuracy of $30%$) while the worst case (worst accuracy estimation) was still better than all experiments with SVM and MLPNs (around $27%$ accuray).
The best hyperparameters are:
no class balancement, gini criterion for the splitting, maximum number of features chosen for each split = sqrt(total number of features), min_samples_leaf = 20 and number of estimators = 50


We can get a better idea of what the best hyperparameters for random forest are, given these data. We can fix a single hyperparameter and see what the average accuracy is, given that hyperparameter, when all the other hyperparameters can change.
For each hyperparameter (for example 'criterion'='gini') compute the mean score and standard deviation among all the experiments where that hyperparameter was set. We want to understand how that hyperparameter affected the estimated accuracy (the accuracy computed on half the data).
For each hyperparameter we can plot an error bar centred in the mean value:

errorbar

As reported previously, the best hyperparameters found for random forests are:
class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'min_samples_leaf': 20, 'n_estimators': 50

The previous plot shows that random forests with 100 trees may perform better than random forest with 50 trees while random forests with only 10 trees have worse performances. Generally speaking, random forests with few trees tend to overfit, so a higher number of estimators is preferred.
The same argomentations are valid for the hyperparameter min_sample_leaf: when the leaves are larger we can reduce the risk of overfitting. Though we had the best accuracy for min_samples_leaf = 20, the following plot shows that in average min_samples_leaf = 50 can lead to better performances.

In both cases (min_samples_leaf=50 and n_estimators=100) the standard deviation takes smaller values compared to all other hyperameters. This means that for these two hyperpameters it's more likely to obtain accuracies similar to the mean represented on the plot: the chances of having bad performances are lower.

All other hyperparameters (class_weight, criterion and max_feature) do not seem to be very decisive.


[1] K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal

About

Machine Learning challenge

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published