GitHub - rssanjeev/Regression-and-Artificial-Neural-Network-Deep-Learning: Classify students based on their score.

title

author

date

output

IST 707 HW4: Regression and Artificial Neural Network/Deep Learning

Sanjeev Ramasamy Seenivasagamani

April 27, 2019

word_document	pdf_document	html_document
default	default	default

Load the required libraries:

require(foreign)
require(ggplot2)
require(MASS)
require(Hmisc)
require(reshape2)
library(rms)
library(data.table)
library(tidyverse)
library(tidytext)
library(caret)
library(rpart)
library("rpart.plot")
library(recipes)
library(randomForest)
library(keras)

Introduction

About:

This is an educational data set which is collected from learning management system (LMS) called Kalboard 360. Kalboard 360 is a multi-agent LMS, which has been designed to facilitate learning through the use of leading-edge technology. Such system provides users with a synchronous access to educational resources from any device with Internet connection.

The dataset consists of 305 males and 175 females. The students come from different origins such as 179 students are from Kuwait, 172 students are from Jordan, 28 students from Palestine, 22 students are from Iraq, 17 students from Lebanon, 12 students from Tunis, 11 students from Saudi Arabia, 9 students from Egypt, 7 students from Syria, 6 students from USA, Iran and Libya, 4 students from Morocco and one student from Venezuela.

This dataset includes also a new category of features; this feature is parent parturition in the educational process. Parent participation feature have two sub features: Parent Answering Survey and Parent School Satisfaction. There are 270 of the parents answered survey and 210 are not, 292 of the parents are satisfied from the school and 188 are not.

The students are classified into three numerical intervals based on their total grade/mark:

Low-Level: interval includes values from 0 to 69,

Middle-Level: interval includes values from 70 to 89,

High-Level: interval includes values from 90-100.

Data load:

data <- read.csv("Students Academic Performance.csv", header = TRUE)
str(data)

Exploratory Data Analysis

Let us have a look at the data distribution through various plots.

Studentcount vs Birthplace

ggplot(data = data, aes(x = PlaceofBirth)) + geom_bar(aes(fill = NationalITy)) + 
    labs(x = "Birth Place", y = "Student Count") + coord_flip()

Studentcount vs Gender

ggplot(data = data, aes(x = Class, fill = gender)) + geom_bar() + 
    labs(x = "Gender", y = "Student Count") + coord_flip()

Studentcount vs GradeID

ggplot(data = data, aes(x = GradeID, fill = gender)) + geom_bar() + 
    labs(x = "Grade ID", y = "Student Count") + coord_flip()

Studentcount vs Section ID

ggplot(data = data, aes(x = SectionID, fill = Topic)) + geom_bar() +
    labs(x = "Section ID", y = "Student Count") +
    coord_flip()

Topic vs Class

ggplot(data = data, aes(x = Class, fill = Topic)) + geom_bar() +
    labs(x = "Class", y = "Topic") +
    coord_flip()

Gender vs Topic

ggplot(data = data, aes(x = Topic, fill = gender)) + geom_bar() +
    labs(x = "Topic", y = "Student Count") +
    scale_y_continuous(breaks = seq(0,100,4)) + coord_flip()

Topic vs Stage ID

ggplot(data = data, aes(x = Topic, fill = StageID)) + geom_bar() +
    labs(x = "Topic", y = "Stage ID") + coord_flip() +
    scale_y_continuous(breaks = seq(0,100,4))

Topic vs SectionID

ggplot(data = data, aes(x = Topic, fill = SectionID)) + geom_bar() +
    labs(x = "Topic", y = "Section ID") + coord_flip() +
    scale_y_continuous(breaks = seq(0,100,4))

Topic vs Student Count

ggplot(data = data, aes(x = Topic, fill = Class)) + geom_bar(position = "fill") +
    labs(x = "Topic", y = "Student Count") + coord_flip() +
    scale_y_continuous(breaks = seq(0,100,4))

AnnouncementViews vs VisitedResources

ggplot(data = data, aes( x = VisITedResources, y = AnnouncementsView)) + geom_point() +
    geom_smooth(method = "lm")

Ordinal Logistic Regression

Data Manipulation:

As you can see that some of the categorical variables have more than 9 levels. This can cause issues while training the Ordinal Regression Model. So, let us try to convert the ordinal data to numbered ranking.

#Converting the variables to character prior to ranking them
data$NationalITy<- as.character(data$NationalITy)
data$PlaceofBirth<- as.character(data$PlaceofBirth)
data$GradeID<- as.character(data$GradeID)
data$Topic<- as.character(data$Topic)

#Nationality
data$NationalITy[data$NationalITy == "Egypt"] <- 1
data$NationalITy[data$NationalITy == "Tunis"] <- 1
data$NationalITy[data$NationalITy == "Lybia"] <- 1
data$NationalITy[data$NationalITy == "Iran"] <- 2
data$NationalITy[data$NationalITy == "Iraq"] <- 2
data$NationalITy[data$NationalITy == "Jordan"] <- 2
data$NationalITy[data$NationalITy == "KW"] <- 2
data$NationalITy[data$NationalITy == "lebanon"] <- 2
data$NationalITy[data$NationalITy == "SaudiArabia"] <- 2
data$NationalITy[data$NationalITy == "Syria"] <- 2
data$NationalITy[data$NationalITy == "Palestine"] <- 2
data$NationalITy[data$NationalITy == "Morocco"] <- 3
data$NationalITy[data$NationalITy == "venzuela"] <- 4
data$NationalITy[data$NationalITy == "USA"] <- 5

#Place of Birth
data$PlaceofBirth[data$PlaceofBirth == "Egypt"] <- 1
data$PlaceofBirth[data$PlaceofBirth == "Tunis"] <- 1
data$PlaceofBirth[data$PlaceofBirth == "Lybia"] <- 1
data$PlaceofBirth[data$PlaceofBirth == "Iran"] <- 2
data$PlaceofBirth[data$PlaceofBirth == "Iraq"] <- 2
data$PlaceofBirth[data$PlaceofBirth == "Jordan"] <- 2
data$PlaceofBirth[data$PlaceofBirth == "KuwaIT"] <- 2
data$PlaceofBirth[data$PlaceofBirth == "lebanon"] <- 2
data$PlaceofBirth[data$PlaceofBirth == "SaudiArabia"] <- 2
data$PlaceofBirth[data$PlaceofBirth == "Syria"] <- 2
data$PlaceofBirth[data$PlaceofBirth == "Palestine"] <- 2
data$PlaceofBirth[data$PlaceofBirth == "Morocco"] <- 3
data$PlaceofBirth[data$PlaceofBirth == "venzuela"] <- 4
data$PlaceofBirth[data$PlaceofBirth == "USA"] <- 5

#Grade ID
data$GradeID[data$GradeID == "G-02"] <- 2
data$GradeID[data$GradeID == "G-04"] <- 4
data$GradeID[data$GradeID == "G-05"] <- 5
data$GradeID[data$GradeID == "G-06"] <- 6
data$GradeID[data$GradeID == "G-07"] <- 7
data$GradeID[data$GradeID == "G-08"] <- 8
data$GradeID[data$GradeID == "G-09"] <- 9
data$GradeID[data$GradeID == "G-10"] <- 10
data$GradeID[data$GradeID == "G-11"] <- 11
data$GradeID[data$GradeID == "G-12"] <- 12

#Subject Topic
data$Topic[data$Topic == "Arabic"] <- 1
data$Topic[data$Topic == "Biology"] <- 2
data$Topic[data$Topic == "Chemistry"] <- 3
data$Topic[data$Topic == "English"] <- 4
data$Topic[data$Topic == "French"] <- 5
data$Topic[data$Topic == "Geology"] <- 6
data$Topic[data$Topic == "History"] <- 7
data$Topic[data$Topic == "IT"] <- 8
data$Topic[data$Topic == "Math"] <- 9
data$Topic[data$Topic == "Quran"] <- 10
data$Topic[data$Topic == "Science"] <- 11
data$Topic[data$Topic == "Spanish"] <- 12

#Converting the variables back to integer instead of factors
data$NationalITy<- as.integer(data$NationalITy)
data$PlaceofBirth<- as.integer(data$PlaceofBirth)
data$GradeID<- as.integer(data$GradeID)
data$Topic<- as.integer(data$Topic)

Splitting the data into train and test dataset:

train_index <- createDataPartition(data$Class, p = 0.6, list = FALSE)
train <- data[train_index, ]
test <- data[-train_index, ]

Training the model:

m <- polr(Class ~., data = train, Hess=TRUE)
summary(m)

We see the usual regression output coefficient table including the value of each coefficient, standard errors, t values, estimates for the two intercepts, residual deviance and AIC. AIC is the information criteria. Lesser the better.

Now we’ll calculate some essential metrics such as p-Value, CI, Odds ratio:

ctable <- coef(summary(m))

Calculating the p-value from the t-value:

p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
ctable <- cbind(ctable, "p value" = p)
ctable

Confidence Intervals:

ci <- confint(m)

exp(coef(m))

## OR and CI
exp(cbind(OR = coef(m), ci))

Interpretation:

Since we have many independent variables we will be having a look at just three of them for interpretation purpose.

The gender of the student being Male increases the odds of the student being in Low or Middle or High level grades combined by 1.326 than Female
When the nnumber of raised hands go up by 1 unit, the odds of moving from Low level to Middle or High levels are multiplied by 0.987
For every parent that views the announcement the odds of the student moving from the Low scoring bracket to Medium or High level increases by 1.008

Let us now go ahead and predict the Class variable using the above model

predictedClass <- predict(m, test)  # predict the classes directly
head(predictedClass)

Confusion Matrix:

table(test$Class, predictedClass)

Miscassification Error:

mean(as.character(test$Class) != as.character(predictedClass))

ARTIFICIAL NEURAL NETWORK

As we have manipulated the data for performing Ordinal Logistic Regression, let us now read the data again and perform train-test split for the Multilayer Perceptron modelling.

data <- read.csv("Students Academic Performance.csv", header = TRUE)
train_index <- createDataPartition(data$Class, p = 0.6, list = FALSE)
train <- data[train_index, ]
test <- data[-train_index, ]

Data Preprocessing:

rec_obj <- recipe(Class ~ ., data = train) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
   step_center(all_predictors(), -all_outcomes()) %>% 
  step_scale(all_predictors(), -all_outcomes()) %>% 
  prep(data = train)
rec_obj

Splitting the dependent and the independent variables in the training and testing datasets

x_train_tbl <- as.matrix(bake(rec_obj, new_data = train) %>% select(-Class))
x_test_tbl <- as.matrix(bake(rec_obj, new_data = test) %>% select(-Class))
y_train_vec <- ifelse(pull(train, Class) == "H", 2,ifelse(pull(train, Class) == "M", 1,0))
y_test_vec <-  ifelse(pull(test, Class) == "H", 2,ifelse(pull(test, Class) == "M", 1,0))
str(x_train_tbl)
dim(x_train_tbl)

library(keras)
use_python("/usr/local/bin/python3")
model_keras <- keras_model_sequential()

model_keras %>%
  layer_flatten(input_shape = ncol(rec_obj)) %>%
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dense(units = 10, activation = 'softmax')

model_keras %>% compile(
  optimizer = 'adam', 
  loss = 'sparse_categorical_crossentropy',
  metrics = c('accuracy')
)

set.seed(100)
fit <- model_keras %>% fit(x_train_tbl, y_train_vec, epochs = 11)

Plot training & validation accuracy values

plot(fit)

We van see from the aboev plot that the accuracy saturates around 84% as the loss continues to drop.

Predicting the Class:

class_pred <- model_keras %>% predict_classes(x_test_tbl)
class_pred[1:20]

NAIVE BAYES CLASSIFIER

Since, this is a classification problem let us try few other algorithms straight out of the box.

model_nb <- train(Class ~., data = train, method = "nb",
                  trControl = trainControl(method = "repeatedcv",
                                             number = 6, repeats = 5),
                  tuneGrid = expand.grid(fL = 1, usekernel = T, adjust = 1))

model_nb

As we can see from the summry, that the model is having just 45% accuracy with a kappa score of 0.0284.

Let us now go ahead and predict the Class variable using the above model and evaluate it through confusion matrix.

predict_nb <- predict(model_nb, newdata = test, type = "raw")
head(predict_nb)

# Confusion Matrix
confusionMatrix(predict_nb, test$Class)

DECISION TREE

Let us now implement the decision tree algorithm using the training dataset and see if it out performs Naive Bayes.

model_dt <- train(Class ~ ., data = train, method = "rpart",
                       metric = "Accuracy",
                       tuneLength = 10,
                       trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3))

Siimilar to the previous steps, we will go ahead and predict the Class variable using the decision tree model.

predict_dt <- predict(model_dt, newdata = test, type = "raw")
head(predict_dt)

# Confusion Matrix
confusionMatrix(predict_dt, test$Class)

As we can see, the Decision Tree algorithm performs really weell and we can see the same in the accuracy of 69.47%.

Let us plot above model so that we can have a better idea about the conditions used for partitioning the data.

library(rattle)
fancyRpartPlot(model_dt$finalModel)

RANDOM FOREST

set.seed(10)

rf.model <- randomForest(Class ~ ., data = train, importance = TRUE,
                         ntree = 2000, nodesize = 20)

rf.predict <- predict(rf.model, test)
confusionMatrix(test$Class, rf.predict)

varImpPlot(rf.model)

With an accuracy of 76.32%, Random Forest still lags behind the Multilayer Perceptron model.

Conclusion

Based on the accuracy measures, the Multilayer Perceptron outperforms all the other models with an accuracy over 80%.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
IST_707_HW4_Sanjeev Ramasamy.pdf		IST_707_HW4_Sanjeev Ramasamy.pdf
IST_707_HW4_Sanjeev.Rmd		IST_707_HW4_Sanjeev.Rmd
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

About:

The students are classified into three numerical intervals based on their total grade/mark:

Data load:

Exploratory Data Analysis

Studentcount vs Birthplace

Studentcount vs Gender

Studentcount vs GradeID

Studentcount vs Section ID

Topic vs Class

Gender vs Topic

Topic vs Stage ID

Topic vs SectionID

Topic vs Student Count

AnnouncementViews vs VisitedResources

Ordinal Logistic Regression

Data Manipulation:

Splitting the data into train and test dataset:

Training the model:

Interpretation:

ARTIFICIAL NEURAL NETWORK

Data Preprocessing:

NAIVE BAYES CLASSIFIER

DECISION TREE

RANDOM FOREST

Conclusion

About

Releases

Packages

rssanjeev/Regression-and-Artificial-Neural-Network-Deep-Learning

Folders and files

Latest commit

History

Repository files navigation

Introduction

About:

The students are classified into three numerical intervals based on their total grade/mark:

Data load:

Exploratory Data Analysis

Studentcount vs Birthplace

Studentcount vs Gender

Studentcount vs GradeID

Studentcount vs Section ID

Topic vs Class

Gender vs Topic

Topic vs Stage ID

Topic vs SectionID

Topic vs Student Count

AnnouncementViews vs VisitedResources

Ordinal Logistic Regression

Data Manipulation:

Splitting the data into train and test dataset:

Training the model:

Interpretation:

ARTIFICIAL NEURAL NETWORK

Data Preprocessing:

NAIVE BAYES CLASSIFIER

DECISION TREE

RANDOM FOREST

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages