Machine Learning Challenge

Machine Learning course 2021-2022, Computer Science Department.
Masters' Degree in Applied Mathematics, Sapienza University of Rome.
Exam date: June 13th 2022

Dataset and task description:

The provided dataset is a modified noisy version of the original dataset described in [1].
This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal of the task is to predict the number of shares in social networks (popularity).

Here is a brief description of the dataset's features:

index	name	description
0	url	URL of the article
1	timedelta	Days between the article publication and the dataset acquisition
2	n_tokens_title	Number of words in the title
3	n_tokens_content	Number of words in the content
4	n_unique_tokens	Rate of unique words in the content
5	n_non_stop_words	Rate of non-stop words in the content
6	n_non_stop_unique_tokens	Rate of unique non-stop words in content
7	num_hrefs	Number of links
8	num_self_hrefs	Number of links to other articles published by Mashable
9	num_imgs	Number of images
10	num_videos	Number of videos
11	average_token_length	Average length of the words in the content
12	num_keywords	Number of keywords in the metadata
13	data_channel_is_lifestyle	Is data channel 'Lifestyle'?
14	data_channel_is_entertainment	Is data channel 'Entertainment'?
15	data_channel_is_bus	Is data channel 'Business'?
16	data_channel_is_socmed	Is data channel 'Social Media'?
17	data_channel_is_tech	Is data channel 'Tech'?
18	data_channel_is_world	Is data channel 'World'?
19	kw_min_min	Worst keyword (min. shares)
20	kw_max_min	Worst keyword (max. shares)
21	kw_avg_min	Worst keyword (avg. shares)
22	kw_min_max	Best keyword (min. shares)
23	kw_max_max	Best keyword (max. shares)
24	kw_avg_max	Best keyword (avg. shares)
25	kw_min_avg	Avg. keyword (min. shares)
26	kw_max_avg	Avg. keyword (max. shares)
27	kw_avg_avg	Avg. keyword (avg. shares)
28	self_reference_min_shares	Min. shares of referenced articles in Mashable
29	self_reference_max_shares	Max. shares of referenced articles in Mashable
30	self_reference_avg_sharess	Avg. shares of referenced articles in Mashable
31	weekday_is_monday	Was the article published on a Monday?
32	weekday_is_tuesday	Was the article published on a Tuesday?
33	weekday_is_wednesday	Was the article published on a Wednesday?
34	weekday_is_thursday	Was the article published on a Thursday?
35	weekday_is_friday	Was the article published on a Friday?
36	weekday_is_saturday	Was the article published on a Saturday?
37	weekday_is_sunday	Was the article published on a Sunday?
38	is_weekend	Was the article published on the weekend?
39	LDA_00	Closeness to LDA topic 0
40	LDA_01	Closeness to LDA topic 1
41	LDA_02	Closeness to LDA topic 2
42	LDA_03	Closeness to LDA topic 3
43	LDA_04	Closeness to LDA topic 4
44	global_subjectivity	Text subjectivity
45	global_sentiment_polarity	Text sentiment polarity
46	global_rate_positive_words	Rate of positive words in the content
47	global_rate_negative_words	Rate of negative words in the content
48	rate_positive_words	Rate of positive words among non-neutral tokens
49	rate_negative_words	Rate of negative words among non-neutral tokens
50	avg_positive_polarity	Avg. polarity of positive words
51	min_positive_polarity	Min. polarity of positive words
52	max_positive_polarity	Max. polarity of positive words
53	avg_negative_polarity	Avg. polarity of negative words
54	min_negative_polarity	Min. polarity of negative words
55	max_negative_polarity	Max. polarity of negative words
56	title_subjectivity	Title subjectivity
57	title_sentiment_polarity	Title polarity
58	abs_title_subjectivity	Absolute subjectivity level
59	abs_title_sentiment_polarity	Absolute polarity level
60	shares	Number of shares (target)

Pre-processing and dataset analysis

The dataset is loaded and cleaned (see the cleaned version). In this notebook feature importance analysis is then performed by removing redundant features. The data is aferwards scaled to perform model selection. For this phase the following steps are considered:

The target feature is discretized (number of classes must be $\geq$ 5).
The dataset is split to perform cross validation.
The following models are trained:

Decision Trees
Support Vector Machines
Random Forest
MLPNs

Hyper-parameter tuning is performed and discussed

The scaled dataset looks like this:

Discretizing the target feature

Before performing feature importance we must define what the target-feature classes are.Using Kmeans we can visualize how the column "shares" could be partitioned in clusters. I chose to visualize 5 clusters.
If we divided the target column in 5 classes based on kmean algorithm we would obtain highly unbalanced classes with cardinalities $37166$, $6$, $29$, $256$, $2183$.
We can visualize how the point are partitioned using Kmeans:

The 5 classes found with KMeans are highly unbalanced. Increasing the number of clusters would not bring to balanced classes, since the points with the highest values would be further split. Oversampling the minority classes would lead to too many synthetic points, and this would bring us too far from real data. The only thing that we could do is undersampling the majority class and then create 5 classes using a quantile splitting.
The cardinality of the new 5 classes are: $4482$, $4481$, $4114$, $4875$, $4522$.

Feature selection

First the data are split into training set and testing set, thens random forest algorithm is applied in order to select features based on the classifier feature importance.

Model selection

For hyperparameter tuning I decided to use GridSearchCV. First the scaled dataset is split into training and testing set The best hyperparameters for a specific algorithm are searched using half of the data (the training set). The classifier with better performances (highest accuracy) is used for the whole dataset and evaluated using cross validation.

SVMs

For SVM the following hyperparameters have been trained:

hyperparameters_1 = [{'kernel': ['linear'], 'C': [1,10,100,1000],'max_iter':[1e6],'tol':[0.01]}]

hyperparameters_2 = [{'kernel': ['poly'], 'C': [1,10,100,1000],'degree':[10,12],'gamma':[0.1,0.05,0.01]}]

hyperparameters_3 = [{'kernel': ['sigmoid'],'C': [1,10,100,1000],'gamma':[0.1,0.001,0.0001]}]

hyperparameters_4 = [{'kernel': ['rbf'],'C': [1,10,100,1000],'gamma':[0.1,0.001,0.0001]}]

SVMs with a linear kernel seems to be the slowest algorithm and its accuracy (with the best hyperparameter, that is $C=100$) is around $20%$. I had to put a limit on the maximum number of iteration and a lower tolerance than the default one in order to put a stop condition on the algorithm.
SVMs with polynomial kernels is also very slow when the degree is low, as it is more similar to svm with linear kernels. With higher degrees (10, 12,.. ) it converges but does not reach a high accuracy ($20%$ in the best case).
SVMs with sigmoid and exponential kernels are much faster and a higher accuracy is reached (around $23%$). The best hyperparameters for SVMs are:
sigmoid kernel, gamma = 0.1, C = 1 for an accuracy of $23,79%$.

Multilayer perceptrions

(using RandomizedSearchCV):

hyperparameters = [{'activation': ['logistic','relu','tanh'],'solver':['sgd','adam','lbfgs'],
                    'max_iter':[10000],'alpha':[1e-5,1e-4,1e-3,1e-2],'learning_rate_init':[0.01,0.005,0.001],
                   'hidden_layer_sizes':[(20,10),(15,7),(20),(15),(25,15,7),(10)],
                   'random_state':[1,2,3,4]}]

Multilayer Perceptron do not perform better than SVMs because they still have low accuracy. In this case, since the number of combinations of hyperparameters is quite high, I chose to use RandomizedGridSearchCV.
The best hyperparameters are:
solver = adam, random_state = 4 (the 4th weight initialization) learning_rate = 0.001, 2 hidden layers with sizes 15 and 7, alpha = 1e-05, activation function = tanh.
The overall accuracy is about $22%$.
I increased the maximum number of iterations to 100000 (200 was the default value) but still lbfgs did not reach convergence in most of the cases.

Decision Trees:

hyperparameters = [{'max_features': [None,'sqrt','log2'],'min_samples_leaf':[2,5,10,20,50],
                    'criterion':['gini','entropy'],'class_weight':['balanced',None]}]

Decision Trees perform much better than SVMs and MLPs as they are faster and more accurate. Accuracy reaches $28.8%$ if min_samples_leaf = 50. This condition implies that a split of the dataset is considered only if it leaves at least 50 training samples in each of the left and right branches. This reduces the risk for overfitting.

Random Forests

hyperparameters = [{'max_features': [None,'sqrt','log2'],'min_samples_leaf':[2,5,10,20,50],
                    'criterion':['gini','entropy'],'class_weight':['balanced',None],'n_estimators':[10,50,100]}]

Random Forests have the best performances: the best accuracy estimation reached is $33,15%$ (for an overall accuracy of $30%$) while the worst case (worst accuracy estimation) was still better than all experiments with SVM and MLPNs (around $27%$ accuray).
The best hyperparameters are:
no class balancement, gini criterion for the splitting, maximum number of features chosen for each split = sqrt(total number of features), min_samples_leaf = 20 and number of estimators = 50

We can get a better idea of what the best hyperparameters for random forest are, given these data. We can fix a single hyperparameter and see what the average accuracy is, given that hyperparameter, when all the other hyperparameters can change.
For each hyperparameter (for example 'criterion'='gini') compute the mean score and standard deviation among all the experiments where that hyperparameter was set. We want to understand how that hyperparameter affected the estimated accuracy (the accuracy computed on half the data).
For each hyperparameter we can plot an error bar centred in the mean value:

As reported previously, the best hyperparameters found for random forests are:
class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'min_samples_leaf': 20, 'n_estimators': 50

The previous plot shows that random forests with 100 trees may perform better than random forest with 50 trees while random forests with only 10 trees have worse performances. Generally speaking, random forests with few trees tend to overfit, so a higher number of estimators is preferred.
The same argomentations are valid for the hyperparameter min_sample_leaf: when the leaves are larger we can reduce the risk of overfitting. Though we had the best accuracy for min_samples_leaf = 20, the following plot shows that in average min_samples_leaf = 50 can lead to better performances.

In both cases (min_samples_leaf=50 and n_estimators=100) the standard deviation takes smaller values compared to all other hyperameters. This means that for these two hyperpameters it's more likely to obtain accuracies similar to the mean represented on the plot: the chances of having bad performances are lower.

All other hyperparameters (class_weight, criterion and max_feature) do not seem to be very decisive.

[1] K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
ML_Challenge_21-22.ipynb		ML_Challenge_21-22.ipynb
OnlineNewsPopularity.csv		OnlineNewsPopularity.csv
OnlineNewsPopularity_cleaned.csv		OnlineNewsPopularity_cleaned.csv
errorbar.png		errorbar.png
kmeans.png		kmeans.png
readme.md		readme.md
scaled_dataset.png		scaled_dataset.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Machine Learning Challenge

Dataset and task description:

Pre-processing and dataset analysis

Discretizing the target feature

Feature selection

Model selection

SVMs

Multilayer perceptrions

Decision Trees:

Random Forests

About

Uh oh!

Releases

Packages

Languages

SofiaTorchia/ML-challenge

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Challenge

Dataset and task description:

Pre-processing and dataset analysis

Discretizing the target feature

Feature selection

Model selection

SVMs

Multilayer perceptrions

Decision Trees:

Random Forests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages