Machine Learning course 2021-2022, Computer Science Department.
Masters' Degree in Applied Mathematics,
Sapienza University of Rome.
Exam date: June 13th 2022
The provided dataset
is a modified noisy version of
the original dataset described in [1].
This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal of the task is to predict the number of shares in social networks (popularity).
Here is a brief description of the dataset's features:
| index | name | description |
|---|---|---|
| 0 | url | URL of the article |
| 1 | timedelta | Days between the article publication and the dataset acquisition |
| 2 | n_tokens_title | Number of words in the title |
| 3 | n_tokens_content | Number of words in the content |
| 4 | n_unique_tokens | Rate of unique words in the content |
| 5 | n_non_stop_words | Rate of non-stop words in the content |
| 6 | n_non_stop_unique_tokens | Rate of unique non-stop words in content |
| 7 | num_hrefs | Number of links |
| 8 | num_self_hrefs | Number of links to other articles published by Mashable |
| 9 | num_imgs | Number of images |
| 10 | num_videos | Number of videos |
| 11 | average_token_length | Average length of the words in the content |
| 12 | num_keywords | Number of keywords in the metadata |
| 13 | data_channel_is_lifestyle | Is data channel 'Lifestyle'? |
| 14 | data_channel_is_entertainment | Is data channel 'Entertainment'? |
| 15 | data_channel_is_bus | Is data channel 'Business'? |
| 16 | data_channel_is_socmed | Is data channel 'Social Media'? |
| 17 | data_channel_is_tech | Is data channel 'Tech'? |
| 18 | data_channel_is_world | Is data channel 'World'? |
| 19 | kw_min_min | Worst keyword (min. shares) |
| 20 | kw_max_min | Worst keyword (max. shares) |
| 21 | kw_avg_min | Worst keyword (avg. shares) |
| 22 | kw_min_max | Best keyword (min. shares) |
| 23 | kw_max_max | Best keyword (max. shares) |
| 24 | kw_avg_max | Best keyword (avg. shares) |
| 25 | kw_min_avg | Avg. keyword (min. shares) |
| 26 | kw_max_avg | Avg. keyword (max. shares) |
| 27 | kw_avg_avg | Avg. keyword (avg. shares) |
| 28 | self_reference_min_shares | Min. shares of referenced articles in Mashable |
| 29 | self_reference_max_shares | Max. shares of referenced articles in Mashable |
| 30 | self_reference_avg_sharess | Avg. shares of referenced articles in Mashable |
| 31 | weekday_is_monday | Was the article published on a Monday? |
| 32 | weekday_is_tuesday | Was the article published on a Tuesday? |
| 33 | weekday_is_wednesday | Was the article published on a Wednesday? |
| 34 | weekday_is_thursday | Was the article published on a Thursday? |
| 35 | weekday_is_friday | Was the article published on a Friday? |
| 36 | weekday_is_saturday | Was the article published on a Saturday? |
| 37 | weekday_is_sunday | Was the article published on a Sunday? |
| 38 | is_weekend | Was the article published on the weekend? |
| 39 | LDA_00 | Closeness to LDA topic 0 |
| 40 | LDA_01 | Closeness to LDA topic 1 |
| 41 | LDA_02 | Closeness to LDA topic 2 |
| 42 | LDA_03 | Closeness to LDA topic 3 |
| 43 | LDA_04 | Closeness to LDA topic 4 |
| 44 | global_subjectivity | Text subjectivity |
| 45 | global_sentiment_polarity | Text sentiment polarity |
| 46 | global_rate_positive_words | Rate of positive words in the content |
| 47 | global_rate_negative_words | Rate of negative words in the content |
| 48 | rate_positive_words | Rate of positive words among non-neutral tokens |
| 49 | rate_negative_words | Rate of negative words among non-neutral tokens |
| 50 | avg_positive_polarity | Avg. polarity of positive words |
| 51 | min_positive_polarity | Min. polarity of positive words |
| 52 | max_positive_polarity | Max. polarity of positive words |
| 53 | avg_negative_polarity | Avg. polarity of negative words |
| 54 | min_negative_polarity | Min. polarity of negative words |
| 55 | max_negative_polarity | Max. polarity of negative words |
| 56 | title_subjectivity | Title subjectivity |
| 57 | title_sentiment_polarity | Title polarity |
| 58 | abs_title_subjectivity | Absolute subjectivity level |
| 59 | abs_title_sentiment_polarity | Absolute polarity level |
| 60 | shares | Number of shares (target) |
The dataset is loaded and cleaned (see the cleaned version). In this notebook feature importance analysis is then performed by removing redundant features. The data is aferwards scaled to perform model selection. For this phase the following steps are considered:
-
The target feature is discretized (number of classes must be
$\geq$ 5). -
The dataset is split to perform cross validation.
-
The following models are trained:
- Decision Trees
- Support Vector Machines
- Random Forest
- MLPNs
- Hyper-parameter tuning is performed and discussed
The scaled dataset looks like this:
Before performing feature importance we must define what the target-feature classes are.Using Kmeans we can visualize how the column "shares" could be partitioned in clusters. I chose to visualize 5 clusters.
If we divided the target column in 5 classes based on kmean algorithm we would obtain highly unbalanced classes with cardinalities
We can visualize how the point are partitioned using Kmeans:
The 5 classes found with KMeans are highly unbalanced. Increasing the number of clusters would not bring to balanced classes, since the points with the highest values would be further split. Oversampling the minority classes would lead to too many synthetic points, and this would bring us too far from real data. The only thing that we could do is undersampling the majority class and then create 5 classes using a quantile splitting.
The cardinality of the new 5 classes are:
First the data are split into training set and testing set, thens random forest algorithm is applied in order to select features based on the classifier feature importance.
For hyperparameter tuning I decided to use GridSearchCV. First the scaled dataset is split into training and testing set The best hyperparameters for a specific algorithm are searched using half of the data (the training set). The classifier with better performances (highest accuracy) is used for the whole dataset and evaluated using cross validation.
For SVM the following hyperparameters have been trained:
hyperparameters_1 = [{'kernel': ['linear'], 'C': [1,10,100,1000],'max_iter':[1e6],'tol':[0.01]}]
hyperparameters_2 = [{'kernel': ['poly'], 'C': [1,10,100,1000],'degree':[10,12],'gamma':[0.1,0.05,0.01]}]
hyperparameters_3 = [{'kernel': ['sigmoid'],'C': [1,10,100,1000],'gamma':[0.1,0.001,0.0001]}]
hyperparameters_4 = [{'kernel': ['rbf'],'C': [1,10,100,1000],'gamma':[0.1,0.001,0.0001]}]
SVMs with a linear kernel seems to be the slowest algorithm and its accuracy (with the best hyperparameter, that is
SVMs with polynomial kernels is also very slow when the degree is low, as it is more similar to svm with linear kernels. With higher degrees (10, 12,.. ) it converges but does not reach a high accuracy (
SVMs with sigmoid and exponential kernels are much faster and a higher accuracy is reached (around
sigmoid kernel, gamma = 0.1, C = 1 for an accuracy of
(using RandomizedSearchCV):
hyperparameters = [{'activation': ['logistic','relu','tanh'],'solver':['sgd','adam','lbfgs'],
'max_iter':[10000],'alpha':[1e-5,1e-4,1e-3,1e-2],'learning_rate_init':[0.01,0.005,0.001],
'hidden_layer_sizes':[(20,10),(15,7),(20),(15),(25,15,7),(10)],
'random_state':[1,2,3,4]}]
Multilayer Perceptron do not perform better than SVMs because they still have low accuracy. In this case, since the number of combinations of hyperparameters is quite high, I chose to use RandomizedGridSearchCV.
The best hyperparameters are:
solver = adam, random_state = 4 (the 4th weight initialization) learning_rate = 0.001, 2 hidden layers with sizes 15 and 7, alpha = 1e-05, activation function = tanh.
The overall accuracy is about
I increased the maximum number of iterations to 100000 (200 was the default value) but still lbfgs did not reach convergence in most of the cases.
hyperparameters = [{'max_features': [None,'sqrt','log2'],'min_samples_leaf':[2,5,10,20,50],
'criterion':['gini','entropy'],'class_weight':['balanced',None]}]
Decision Trees perform much better than SVMs and MLPs as they are faster and more accurate. Accuracy reaches
hyperparameters = [{'max_features': [None,'sqrt','log2'],'min_samples_leaf':[2,5,10,20,50],
'criterion':['gini','entropy'],'class_weight':['balanced',None],'n_estimators':[10,50,100]}]
Random Forests have the best performances: the best accuracy estimation reached is
The best hyperparameters are:
no class balancement, gini criterion for the splitting, maximum number of features chosen for each split = sqrt(total number of features), min_samples_leaf = 20 and number of estimators = 50
We can get a better idea of what the best hyperparameters for random forest are, given these data. We can fix a single hyperparameter and see what the average accuracy is, given that hyperparameter, when all the other hyperparameters can change.
For each hyperparameter (for example 'criterion'='gini') compute the mean score and standard deviation among all the experiments where that hyperparameter was set. We want to understand how that hyperparameter affected the estimated accuracy (the accuracy computed on half the data).
For each hyperparameter we can plot an error bar centred in the mean value:
As reported previously, the best hyperparameters found for random forests are:
class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'min_samples_leaf': 20, 'n_estimators': 50
The previous plot shows that random forests with 100 trees may perform better than random forest with 50 trees while random forests with only 10 trees have worse performances. Generally speaking, random forests with few trees tend to overfit, so a higher number of estimators is preferred.
The same argomentations are valid for the hyperparameter min_sample_leaf: when the leaves are larger we can reduce the risk of overfitting. Though we had the best accuracy for min_samples_leaf = 20, the following plot shows that in average min_samples_leaf = 50 can lead to better performances.
In both cases (min_samples_leaf=50 and n_estimators=100) the standard deviation takes smaller values compared to all other hyperameters. This means that for these two hyperpameters it's more likely to obtain accuracies similar to the mean represented on the plot: the chances of having bad performances are lower.
All other hyperparameters (class_weight, criterion and max_feature) do not seem to be very decisive.
[1] K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal


