TP1.txt

Attention:
- Do not edit this file in text editors like Word. Use a plain text editor only. In case of doubt, you can use Spyder as a text editor.
- Do not change the structure of this file. Just fill in your answers in the places provided (After the R#: tag).
- You can add lines in the spaces for your answers but your answers should be brief and straight to the point.

QUESTIONS:

Q1: Considering the data provided, explain the need to standardize the attribute values.

R1: We need to rescale our data for some numerical problems.
    The values given were a bit different, we had a significant range of values,
    which can cause numerical instability and reproducibility of approach.
    Furthermore, the attribute values could be negative so we rescaled with standardization in
    order to the average values could be 0 and the standard deviation could be 1.


Q2: Explain how you calculated the parameters for standardization and how you used them in the test set.

R2: The standardization for the test set was calculated by row (feature),
    by subtracting the values of the features of the test set for the average value of that features in
    the train set and then, dividing that result by the standard deviation of the training set as well.

train_means = np.mean(xs_train_features,axis=0)
train_stdevs = np.std(xs_train_features,axis=0)
xs_test_features_std = ( ( xs_test_features - train_means ) / train_stdevs )


Q3: Explain how you calculated the prior probability of an example belonging to a class (the probability before taking into account the attribute values â€‹â€‹of the example) in your Naïve Bayes classifier implementation. You may include a relevant piece of your code if this helps you explain.

R3: In order to calculate the prior probability of
    an example belonging to a class, we iterate through each possible class
    (2 classes possible, in this case, i.e., a banknote being real or being fake),
    we get the number of samples that belong to the specific current class and divide that for
    the total number of samples in the train set.

nb_prior_prob_occurrences_for_current_class_train = ( len(xs_train[ys_train == current_class]) / num_samples_xs_train )


Q4: Explain how your Naïve Bayes classifier predicts the class to which a test example belongs. You may include a relevant piece of your code if this helps you explain.

R4: In order to predict to which class a test example belongs we calculate
    the probability from the logarithm densities for every feature of
    both classes which results in an array, using the score_samples method,
    from the Kernel Density Estimation.

nb_log_densities_per_class_test[:, current_class] += kde.score_samples(xs_test[:, [current_feature]])

And then, sum them with the respective logarithms of the prior probabilities of
the occurrence for each class in the train set.

nb_log_densities_per_class_test[:, current_class] += nb_logs_prior_prob_classes_occurrences_test_list[current_class]

And finally for each sample of the test set we get the maximum argument,
which represents the class with the higher probability of occurrence,
derived from the prediction from the classifier.

for current_sample_x_test in range(num_samples_xs_test):
   nb_predict_classes_xs_test[current_sample_x_test] = np.argmax( nb_log_densities_per_class_test[current_sample_x_test] )


Q5: Explain the effect of the bandwidth parameter on your classifier.

R5: The choice of bandwidth influence the accuracy results of classification.
    A small bandwidth value will give a too detailed curve and hence leads to
    an estimation with small bias and large variance of the results.
    A large bandwidth leads to low variance, at the expense of increased bias of the results.


Q6: Explain what effect the C parameter has on the Logistic Regression classifier.

R6: As we change input variables, the model's capability to predict results
    changes less than it would have without the regularization.
    The C parameter is a regularization factor,
    it can improve the generalization performance, or in other words,
    it's used to avoid overfitting.
    In this case, since the set of samples don't have too much overfitting,
    at general, from a certain regularization offered by the C parameter to the next ones,
    the accuracy of the results, will not be improved anymore.


Q7: Explain how you determined the best bandwidth and C parameters for your classifier and the Logistic Regression classifier. You may include a relevant piece of your code if this helps you explain.

R7: For the Logistic Regression Classifier we initialize a current C parameter with
    an initial value (1e-2 * 10 **(current_exp_factor)) being that current_exp_factor
    a variable to change the current c parameter.
    In each iteration of the current_exp_factor we calculate the train and validation errors,
    and then, sum them for each combination of K folds, from Cross-Validation.
    Afterwards, we calculate the sum averages of both train and validation erros,
    dividing them by the number of folds and if the result of
    the average of the current validation error is lower than the actual best one,
    we update it with the new value of the current best validation error and
    the current best c value.

current_c_param_value = ( initial_c_param_value * 10**(current_exp_factor) )

for train_idx, valid_idx in k_folds.split(ys_train_classes, ys_train_classes):
	logReg_train_error, logReg_valid_error = compute_logReg_errors	(xs_train_features_std, ys_train_classes, train_idx, valid_idx, 	current_c_param_value, num_features, 'brier_score')

         logReg_train_error_sum += logReg_train_error
         logReg_valid_error_sum += logReg_valid_error

logReg_train_error_avg_folds = (logReg_train_error_sum / NUM_FOLDS)
logReg_valid_error_avg_folds = (logReg_valid_error_sum / NUM_FOLDS)

if(logReg_best_valid_error_avg_folds > logReg_valid_error_avg_folds):
    logReg_best_valid_error_avg_folds = logReg_valid_error_avg_folds
    logReg_best_c = current_c_param_value


  For our classifier (i.e., the Naïve Bayes, with custom KDEs), we initialize a
  current_bandwidth and change it in every iteration and then we do basically the same.
  In each iteration we calculate the training and validation errors and
  sum them for each combination of K folds.
  Afterwards, we calculate the sum averages of both training and validation erros,
  dividing them once more, by the number of folds and if that result is lower than
  the actual best we update it with the new value and the current best bandwidth value.

for current_bandwidth in np.arange(initial_bandwidth, ( final_bandwidth + bandwidth_step ), bandwidth_step):

    nb_train_error_sum = 0
    nb_valid_error_sum = 0

    for train_idx, valid_idx in k_folds.split(ys_train_classes, ys_train_classes):

         nb_train_error, nb_valid_error = compute_naive_bayes_errors(xs_train_features_std, ys_train_classes, train_idx, valid_idx, current_bandwidth, num_classes, num_features)

         nb_train_error_sum += nb_train_error
         nb_valid_error_sum += nb_valid_error


     nb_train_error_avg_folds = ( nb_train_error_sum / NUM_FOLDS )
     nb_valid_error_avg_folds = ( nb_valid_error_sum / NUM_FOLDS )

     if(nb_best_valid_error_avg_folds > nb_valid_error_avg_folds):
         nb_best_valid_error_avg_folds = nb_valid_error_avg_folds
         nb_best_bandwidth = current_bandwidth


Q8: Explain how you obtained the best hypothesis for each classifier after optimizing all parameters.

R8: For Logistic Regression after optimizing the C parameter,
    we initialize the scikit-learn library class LogisticRegression with
    that value and fit the model with the whole train set
    (together with the validation data).

    For Naïve Bayes, with custom KDEs, after optimizing the bandwidth parameter,
    we initialize the scikit-learn library class KernelDensity with
    the optimized bandwidth value and for every pair (class, feature),
    we fit the model with the whole train set (together with the validation data),
    corresponding to that pair.

    For Gaussian Naïve Bayes, we don't have to optimize any parameter,
    we simply initialize the scikit-learn library class GaussianNB and
    fit the model with the whole train set (together with the validation data).


Q9: Show the best parameters, the estimate of the true error for each hypothesis you obtained (your classifier and the two provided by the library), the ranges in the expected number of errors given by the approximate normal test, the McNemar test values, and discuss what you can conclude from this.

R9:

Logistic Regression Classifier:
- Best Value for Regularization C parameter: 100.0
- Estimated True Error: 0.0755657212884699
- Approximate Normal Test, with Confidence Level of 95% = [ 134 - 21.442168949301717 ; 134 + 21.442168949301717 ]
- Approximate Normal Test Interval = [ 112.55783105069828 ; 155.44216894930173 ]


Naïve Bayes Classifier:
- Best Value for Regularization Bandwidth: 0.12000000000000001
- Estimated True Error: 0.11323763955342903
- Approximate Normal Test, with Confidence Level of 95% = [ 142 - 21.993982184056613 ; 142 + 21.993982184056613 ]
- Approximate Normal Test Interval = [ 120.00601781594338 ; 163.99398218405662 ]


Gaussian Naïve Bayes Classifier:
- Estimated True Error: 0.12998405103668265
- Approximate Normal Test, with Confidence Level of 95% = [ 163 - 23.34067871623722 ; 163 + 23.34067871623722 ]
- Approximate Normal Test Interval = [ 139.65932128376278 ; 186.34067871623722 ]


McNemar Test Results:
- McNemar Test #1: Logistic Regression Classifier vs. Naïve Bayes Classifier: 0.6282051282051282
- McNemar Test #2: Logistic Regression Classifier vs. Gaussian Naïve Bayes Classifier: 12.444444444444445
- McNemar Test #3: Naïve Bayes Classifier vs. Gaussian Naïve Bayes Classifier: 6.557377049180328


Based on the estimated true/test error results (expected error for all possible examples),
we can observe that the values are quite similar and low,
so we can conclude that our classifiers are very good.

In other words, with the optimization on them,
from the regularization parameters and hyperparameters,
they can avoid overfitting situations, which means that,
their performance outside the train set would be very good.

We can also see, based on the approximate normal test
(which consider that, as we reach an infinite number of trials,
 the distribution of the actual errors that we find,
 when we test our classifier and its true error will tend to a normal distribution),
that purely based on the number of true errors made,
the Logistic Regression Classifier is the best one followed by
the Naïve Bayes and, in last place, we have Gaussian Naïve Bayes.

Furthermore, the number of errors made are approximately of the order of 1/10,
with regards to the number of samples, what means that in every 10 examples,
we are going to misclassify 1 of them.

By observing the McNemar's test
(which can tell us if the classifiers are making statistically different mistakes,
 in other words it can actually tell us if they are behaving the same or not),
we can conclude, with 95% confidence, that:
- Logistic Regression is likely better than Gaussian Naïve Bayes;
- Naïve Bayes is likely better than Gaussian Naïve Bayes;
- Logistic Regression and Naïve Bayes are making statistically the same mistakes/errors.

With this, we can verify clearly that the Gaussian Naïve Bayes is, by far, the worse one.
And, we can also conclude that Logistic Regression and Naïve Bayes are very close (or similar).