MNIST Handwritten Digit Classifier

Dataset: Popular machine learning dataset fromĀ http://yann.lecun.com/exdb/mnist/ . Images are 24×24 pixels thus resulting in 784 explanatory variables. Each pixel will have a gray scale numerical value. The label/response variable is given from digits 1-9. Training set has 7000 observations while the test set is around 2000.

Goal: Use training/test set method in conjunction with ML algorithms to classify handwritten digits correctly.

 

 

 

 

 

 

We can try a decision tree first, it’s the easiest and can sometimes lead to good results.

library(rpart)
dtree <- rpart(as.factor(label) ~ ., data = dat.train, method='class')
plot(dtree, uniform=FALSE, margin=0.1)
text(dtree,use.n=F)


 

Each node makes a decision to split based on a threshold. It is safe to say that each split should result in the most homogeneous sub-tree.
I personally like to imagine a decision tree as one of those carnival games where you drop a marble into a box of pegs, but in this case the pegs are nodes deciding where the marble will fall.

With classifiers, MSPE is not appropriate. We can compare our predictions with the test set. The number of correctly classified digits divided by the total number of rows in our test data.

dtree.pr <- predict(dtree, newdata=dat.test, type='class')
dtree.pr.acc <- 1- sum(as.numeric(as.numeric(levels(dtree.pr))[dtree.pr]- as.numeric(dat.test$label) != 0))/nrow(dat.test)
dtree.pr.acc

#0.6165769 – so 61% success rate is not the best. This is to be expected. Decision trees are known to be unstable; a small change in the data could result in a COMPLETELY different tree.

So let’s try something else.

Bagging / bootstrapping

Now imagine I went up to 20 different magical talking decision trees and asked them what digit I have written on a piece of paper.
eight of them say the number 7 and the other twelve say it’s the number 9. We take the majority as our prediction. (for regression we can take the average response but this is classification!!)

We don’t have 20 different magical talking trees, so we need to make them. This process is called bootstrapping: randomly sample with replacement from our current training set to create 20 new pseudo-training sets. This can be computationally intense and could take a while.

my.c <- rpart.control(minsplit = 3, cp = 1e-6, xval = 10)
ensemble <- 20
ts <- vector('list',ensemble)
n <- nrow(dat.train)
for (j in 1:ensemble){
ii <- sample(1:n, replace=TRUE)
ts[[j]] <- rpart(label ~ ., data = dat.train[ii,], method = 'class', parms = list(split = 'information'), control= my.c)
}

Training each of the 20 training sets results in 20 different decision trees. This process further reduces variance.

It’s important to acknowledge that these trees seem over fit and we need to resist the urge of pruning, which is the process of simplifying trees by removing leaves/branches with low predictive power. In the case of bagging, an ensemble of overfit trees provides us with a better majority vote.

Let’s calculate our prediction error now:

prs <- list()
for(j in 1:ensemble) {
  pred <- predict(ts[[j]], newdata=dat.test, type='class')
  prs[[j]] <- as.matrix(pred)
}
prs.mx = do.call(cbind, prs)
prs.bagg <- c()
for (i in 1:dim(prs.mx)[1]){
  prs.bagg <- c(prs.bagg, mode(prs.mx[i,]))
}

bagg.pr.acc <- 1- sum(as.numeric(as.numeric(prs.bagg) - as.numeric(dat.test$label) != 0))/nrow(dat.test)

#0.8445829 – around 84% success rate! Pretty good but we still have one more thing to try…

Random Forest

Random forest in my experience is usually the best option. Similar to bagging, we have an ensemble, however it attempts to break up correlation by randomly limiting the features available at each node in each tree. Some drawbacks of random forest is that we need to take a sufficiently large ensemble in order for all features to be included.

library(randomForest)
dat.tr <- as.factor(dat.train$label)
randomf <- randomForest(as.factor(label) ~ ., data = dat.train, ntree = 50)
randomf.pr <- predict(randomf, newdata=dat.test, type='class')
randomf.pr.acc <- 1-sum(as.numeric(as.numeric(levels(randomf.pr))[randomf.pr] - as.numeric(x.test$label) != 0))/nrow(dat.test)

#0.93465612 – 93% success rate!! Pretty good, I think we can stop there.

Leave A Comment

Your email address will not be published. Required fields are marked *