Using Cluster Analysis and Decision Tree algorithm to solve a mystery in history: who wrote the disputed essays, Hamilton or Madison?
library(factoextra)
library(stringr)
library(tidyr)
library(gridExtra)
library(FunCluster)
library(rpart)
library(caret)
library(rattle)
#Loading the data
data<-read.csv("Disputed_Essay_data.csv")
#str(data)
#Summary of the authors
summary(data$author)
data$owner <- ifelse(data$author == 'HM', 'HM', ifelse(data$author == 'Jay', "J", ifelse(data$author == 'Madison', 'M', ifelse(data$author == 'dispt', 'D', ifelse(data$author == 'Hamilton', 'H', NA)))))
data<-extract(data, filename, into = c("Name", "Num"), "([^(]+)\\s*[^0-9]+([0-9].).")
data$file<-paste(data$owner,"-",data$Num)
rownames(data)<-data$file
data<-data[c(-(ncol(data)-1))]
data<-data[c(-(ncol(data)))]
data<-data[c(-2,-3)]
As we are only conserned about the authorship of the disputed articles and only among Hamilton and Madison. SO, we can go ahead and remove 'Jay' and 'HM'
d <- data[data$author!="Jay",]
data <- d[d$author!="HM",]
data<-droplevels(data)
As we have made few changes to the data, let us have a look at it.
head(data, 5)
The Eucldena distance is calculated to measure the distance between the vectors and in here we use it to measure the similarity between the files. As we can see from the below plot that the files intersecting at the blue point are very similar and the ones at the red are not.
distance<-get_dist(data)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
Clustering is an unsupervised learning technique. It is the task of grouping together a set of objects in a way that objects in the same cluster are more similar to each other than to objects in other clusters. Similarity is an amount that reflects the strength of relationship between two data objects. Clustering is mainly used for exploratory data mining. It is used in many fields such as machine learning, pattern recognition, image analysis, information retrieval, bio-informatics, data compression, and computer graphics.
set.seed(42)
def <- kmeans(data[c(-1)], centers = 5)
t(table(data[,1],def$cluster))
From the above result we can see that the disputed articles have been well spread across the authors. The reson being, usage of many clusters. SO we have to find the optimal number of clusters to to gain the accurate answer. Let us have a look at the clusters that we have so far.
fviz_cluster(def, data = data[c(-1)])
set.seed(123)
wss <- function(k){
return(kmeans(data[c(-1)], k, nstart = 25)$tot.withinss)
}
k_values <- 1:10
wss_values <- purrr::map_dbl(k_values, wss)
plot(x = k_values, y = wss_values,
type = "b", frame = F,
xlab = "Number of clusters K",
ylab = "Total within-clusters sum of square")
From the above graph, it is safe to say that 4 is the optimal number of clusters for this dataset.
set.seed(48)
def <- kmeans(data[c(-1)], centers = 4, nstart = 15, iter.max = 100)
t <- t(table(data[,1],def$cluster))
t
As we can see from the above result that the disputed articles were authored by Madison.
fviz_cluster(def, data = data[c(-1)])
Let us have a another look at the way the cluster formation varies with gradual increase in number of clusters.
k2 <- kmeans(data[c(-1)], centers = 2, nstart = 25)
k3 <- kmeans(data[c(-1)], centers = 3, nstart = 25)
k4 <- kmeans(data[c(-1)], centers = 4, nstart = 25)
k5 <- kmeans(data[c(-1)], centers = 5, nstart = 25)
k6 <- kmeans(data[c(-1)], centers = 6, nstart = 25)
k7 <- kmeans(data[c(-1)], centers = 7, nstart = 25)
p2 <- fviz_cluster(k2, geom = "point", data = data[c(-1)]) + ggtitle("k = 2")
p3 <- fviz_cluster(k3, geom = "point", data = data[c(-1)]) + ggtitle("k = 3")
p4 <- fviz_cluster(k4, geom = "point", data = data[c(-1)]) + ggtitle("k = 4")
p5 <- fviz_cluster(k5, geom = "point", data = data[c(-1)]) + ggtitle("k = 5")
p6 <- fviz_cluster(k6, geom = "point", data = data[c(-1)]) + ggtitle("k = 6")
p7 <- fviz_cluster(k7, geom = "point", data = data[c(-1)]) + ggtitle("k = 7")
grid.arrange(p2, p3, p4, p5, p6, p7, nrow = 3)
Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.
hac_output <- hclust(dist(data[c(-1)], method = "euclidean"), method = "ward.D2")
plot.new()
plot(hac_output,main="Dendogram using HAC algorithm",xlab = "Author", ylab = "Euclidean Distance", cex = 0.6, hang = -1)
rect.hclust(hac_output, k=4)
Even here, we can clearly see that the disputed articles have been clustered together with the articles authored by Madison.
Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, decision tree algorithm can be used for solving regression and classification problems too.
The general motive of using Decision Tree is to create a training model which can use to predict class or value of target variables by learning decision rules inferred from prior data(training data).
Splitting the data into training and testing based on the author name.
test <- data[data$author=="dispt",]
train <- data[data$author!="dispt",]
train<-droplevels(train)
test<-droplevels(test)
Let us now perform decision tree analysis on this training data. But, in the prediction part, the 'type' we use is probability.
dt_model <- train(author ~ ., data = train, metric = "Accuracy", method = "rpart")
dt_predict <- predict(dt_model, newdata = test, na.action = na.omit, type = "prob")
head(dt_predict, 11)
Thus, with 93.75% probability the disputed articles belng to madison.
print(dt_model)
fancyRpartPlot(dt_model$finalModel)
dt_predict2 <- predict(dt_model, newdata = test, type = "raw")
print(dt_predict2)
From the predicting model of type 'RAW', we can reconfirm that the discputed articles have been authored by Madison.
dt_model_preprune <- train(author ~ ., data = train, method = "rpart",
metric = "Accuracy",
tuneLength = 8,
control = rpart.control(minsplit = 50, minbucket = 20, maxdepth = 6))
print(dt_model_preprune$finalModel)
fancyRpartPlot(dt_model_preprune$finalModel)
In both the models above, we can clearly see that the word 'upon' plays a significant role. The frequency of this word seems to determine the authorship of the whole file (surprisingly!). And the tuning and pruning has increased the required frquency from 0.019 to 0.024. If it's greater than the said value, then the file belongs to Hamilton else, its writting by Madison.
Cross-validation is a statistical method used to estimate the skill of machine learning models.
It is commonly used in applied machine learning to compare and select a model for a given predictive modeling problem because it is easy to understand, easy to implement, and results in skill estimates that generally have a lower bias than other methods.
tr_control <- trainControl(method = "cv", number = 3)
tr_control <- trainControl(method = "cv", number = 3)
dt_model_cv <- train(author ~ ., data = train, method = "rpart",
metric = "Accuracy",
tuneLength = 8,
control = rpart.control(minsplit = 30, minbucket = 10, maxdepth = 5, cp = 0.5, trcontrol = tr_control,na.rm = T))
print(dt_model_cv$finalModel)
dt_predict3 <- predict(dt_model_cv, newdata = test, type = "raw")
print(dt_predict3)
So we can hereby conclude that, the disputed articles were authored by Madison.