Julio C. Gonzalez
Dec 16, 2017

With so many statistical methods and new tweaks coming out periodically, it can feel a little overwhelming when choosing one to conduct your analysis.

Random forests are considered a state of the art statistical method that, despite being fairly new, are already quite popular. I still remember Dr. Ko, my stats professor in grad school, say that this was his favorite machine learning method and now I can see why. While there isn’t a single method that will dominate all other methods, random forests are great place to start. My goal is to make you, the reader, a “random forest tree-hugger” too.

Trees

In order to explain how a whole random forest works, we must first look into how a single decision tree is created.

For both regression and classifications problems, there are two basic steps:

we divide the predictor space (i.e. all the possible values for \(X_1\), \(X_2\),…., \(X_p\)) into J distinct non-overlapping regions, \(R_1\), \(R_2\),…., \(R_J\). Don’t be thrown off when you see someone say predictor or feature. All these words are just other names used to call variables.
for every observation that falls into the region \(R_J\), we make the same prediction, which is the majority for a classification model (if there are equal number of classes in region \(R_J\), then one class is randomly selected) or the average for a regression tree.

I included the underlying math for the more mathematically inclined people like us who like to be assured that someone didn’t just pull this from thin air.

This leads to the natural question of which variable \(X_j\) do we split and where to split, s, to create region \(R_j\)? The answer is we choose the split that results in the lowest Residual Sum of Squares (RSS) on the training data. Or more formally stated, we choose j and s: \[R_1(j,s) = \{ X | X_j < s\}, \quad R_2(j,s) = \{ X | X_j \geq s\}\] such that minimizes the following equation: \[\sum_{i: x_i \in R_1(j,s)}(y_i - \hat{y}_{R_1} )^2 \quad+ \sum_{i: x_i \in R_2(j,s)}(y_i - \hat{y}_{R_2} )^2\]

All this boils down to is an optimization problem that with the use of computers, can be solved without you breaking a sweat. For any economics fans out there, solving optimization problems with constraints hits home since the discipline has been doing it for years.

Don’t let all this math mumbo jumbo scare you. Once we go through this together, you’ll see this big bad equation turns into a piece of cake (literally!). The notation of \(\{ X | X_j < s\}\) is interpreted as the region of predictor space in which \(X_j\) takes on a value less than s. We assign this space to be \(R_1\). Basically, imagine you have a cake that represents all the possible values for 2 variables. The width is one variable while the height is another variable. When you cut the cake in half in any length, you inherently end up with two pieces of cake. Same thing here, you cut the the predictor space at s and all the possible values for the variable \(X_j\) that are less than s is now classified as \(R_1\). You do the same thing to \(\{ X | X_j \geq s\}\) only now we want all the values greater than or equal to s. How we determine s is by looking at the equation we are trying to minimize. The first part of this equation (i.e. \(\sum_{i: x_i \in R_1(j,s)}(y_i - \hat{y}_{R_1} )^2\) ) is summing all the squared differences between the observed \(y_i\) and the predicted \(\hat{y}\) only found in the region \(R_1\). This difference between these two values are known as the residual. This is what was previously mentioned as the Residual Sum of Squares since we are adding all the squared residuals in the first region. We want to minimize this value since a low RSS would indicate the model performs well while the converse implies a poor model fit. Similarly, we simultaneously look for the lowest RSS for \(R_2\) too.

Super Mario Example

Remember back to when you were a kid (or adult, we don’t judge) and you would compete with your friends to see who would get the highest score on Super Mario Bros. Well, your score at the end of the game is determined by a number of variables like how fast you beat the level, how many enemies you beat or how many powerups you got. Say we are able to go back to those years and we record all the times you played Mario. However, we were only able to capture the number of coins and the number of enemies you beat and the score you got for that particular level. With this data at hand, fast forward to present day where you want to predict your score given only these two variables.

Good ol’ Mario

One way to do this is by making a decision tree. We grab this data, put it through the algorithm and it outputs the following image.

Super Mario Tree

The nice thing about trees is that they have a nice visual representation and they closely follow how the human mind would make decisions (personally, I would’ve called them roots instead of trees because the diagrams created look more like real roots rather than an upside tree but that’s neither here nor there).

The way you interpret a tree is by looking at the top of the branches where the splits occur. If the condition given is true, then you follow the left branch otherwise you go right. Going back to the Mario example, the top split shows that, without knowing anything else, when you beat less than 8 enemies in the level your predicted score is 4983. However, when you beat 8 or more bad guys, your predicted score now depends on the amount of coins you accumulated during the level. If you beat 8 or more enemies and got less than 23 coins, the algorithm predicts your score to be 5729 which is determined by the left branch on lower branch split. Finally, if you beat 8 or more enemies and get 23 or more coins then your score is predicted to be 6250.

Predictor Space Divided by Tree

Here is another way to display how this tree works is by looking at the predictor space. If we try to apply our cake explanation from earlier, this is how would look. This is how we choose which observations go in their respective region or more formally written: \(\; R_1\) = \(\{ X | Enemies < 8\}\), \(\; R_2\) = \(\{ X | Enemies \geq 8, Coins < 23\}\), \(\; R_3\) = \(\{ X | Enemies \geq 8, Coins \geq 23\}\).

Classification Tree

Another reason why I love decision trees is because they can handle categorical variables. While they’re also sometimes called qualitative variables, there are a ton of other names used like in mathematics, we call them discrete values and in computer science they’re called enumerated types or factors. They are all referring to the same thing. Usual examples include male vs female, up vs down or Democrat vs Republican and each of these categories are called levels or classes. Classification trees work similar to regression trees but use a slightly different formula under the hood. In practice, two different measures are used. The first is called the Gini index and it is defined by: \[G = \sum_{k=1}^{K}\hat{p}_{mk}(1-\hat{p}_{mk}), \quad \mathrm{where} \quad \hat{p}_{mk} = \frac{1}{N}\sum_{x_i \in R_m}I(y_i = k).\]

It measures the total variance across the K classes. You must be wondering what on God’s green earth does \(\hat{p}_{mk}\) mean? In plain English, it is the proportion of observations in the mth region that are from the kth class. The capital \(I\) is called an indicator function where if the condition inside is met then it outputs a 1. If the condition is not met, then a zero is returned. The \(\sum_{x_i \in R_m}\) outside of the indicator function looks at all the observations in a particular region and sums all the 1s only to divide them by the total data points in the region \((\frac{1}{N})\) which is how we commonly do proportions in the real world. If we have 100% of the observation in class k in region m, then the mth node is “pure” i.e. \(\hat{p}_{mk}\) is equal to 1. The flip side to that is when \(\hat{p}_{mk}\) is equal to 1, the whole thing goes to zero since \(\hat{p}_{mk}(1-\hat{p}_{mk})\) gets reduced to 1(1-1) = 1(0) = 0. I know it can get a little confusing but try to remember that our main goal is for \(\hat{p}_{mk}(1-\hat{p}_{mk})\) to be as small as possible but in order to do that we need for \(\hat{p}_{mk}\) to be close to 1 as possible as well.

The second measure commonly used is entropy, given by \[D = -\sum_{k=1}^{K}\hat{p}_{mk}\,\mathrm{log}\,(\hat{p}_{mk}).\] Similarly to the Gini index, we want for the entropy value to be small however we will do that when \(\hat{p}_{mk}\) takes on a value that is close to 1 or 0. For those of you who have taken calculus, it is fairly obvious that \(\hat{p}_{mk}\,\mathrm{log}\,(\hat{p}_{mk})\) will be zero when \(\hat{p}_{mk}\) equals 0 since 0log(0) = 0 because \(\lim_{x\to0+}x\;log(x) = 0\). By the same token, when \(\hat{p}_{mk}\) is equal to 1, \(\hat{p}_{mk}\,\mathrm{log}\,(\hat{p}_{mk})\) turns to 0 as 1 log(1) = 1 (0) = 0. Just as it was done with the Gini Index, you sum up all the values returned for each region.

I know you must be thinking which one is better, it turns out that entropy and the Gini index are quite similar numerically.

If you’re still a little skeptical to jump on the trees bandwagon, let’s see what the pros have to say about trees and their place in the Major Leagues of statistical methods. Let’s take a quick look at a table from the distinguished and renowned stats book “Elements of Statistical Learning” by Trevor Hastie and Robert Tibshirani, two world-class professors of statistics at Stanford University.

We can see that the Trees dominate on the majority of categories when compared to other statistical methods.

In addition, here is a paper written at Cornell that does an empirical comparison of supervised learning algorithms based on different datasets and finds random forests and boosting (also another type of trees) to have the best performance overall.

Variance vs Bias

Decision trees sound like the best thing since well… Super Mario Bros. Unfortunately, the downside of using a single tree as a predictive model is that trees suffer from high variance while also being susceptible to bias. When I first started studying stats, I would hear people throw the words variance and bias around a lot but I didn’t fully understand what it meant even though I was nodding my head calmly while looking dead center at the professor’s eyes only to be thinking what I would be having for lunch. If that sounds familiar, well it’s a good thing you have the Stats Whisperer in your corner to help out.

A model that has high variance means that small changes in the data results in different model outputs. What does this mean in the context of trees? Say I am using a decision tree for a data set that has many variables. I run the algorithm and it outputs a tree with 7 branches. Now suppose I remove one observation from the data and rerun the algorithm only to get 3 branches this time. That’s high variance. The models produced vary too much to have robust reliable predictions.

Bias refers to the error that is found when you try to approximate the true underlying relationship using a model that is much simpler than appropriate to capture this relationship. Generally, with more complicated models, there is tradeoff between high variance and low bias. On the other hand, simpler models tend to have low variance but suffer from high bias. Our goal as analysts is to minimize both variance and bias.

Another term related to bias is something called “overfitting”. You can make a model that is super complicated just so you can get high prediction rates for the data you currently have to train said model but that doesn’t mean it’ll perform well when you introduce new data to it. To combat this issue, it is common practice to divide your data into a training set where, as the name suggests, is used to train your model and a test set where your model is validated.

There is however a dark side to trees. The prediction accuracy of decision trees is not at par with other statistical methods which is an important factor for statistical analysis. However, not all hope is lost. The words of the young and great American poet Mattie Stepanek, “unity is strength. . . when there is teamwork and collaboration, wonderful things can be achieved”, inspired me to write my own quote:

“If a tree falls in a forest, I don’t know if it makes a sound or not but if you want your tree back, just grow a new one from a bootstrap sample - The Stats Whisperer.

Bootstrap

In order to reduce variance, random forests utilize something called bootstrap samples to build decision trees. Bootstrapping is a very powerful tool used in statistics with a very simple idea. Bootstrap samples are simply samples of equal size of the data randomly selected with replacement meaning that an observation selected is put back to the original pool of data and thus can be found multiple times in a single sample. Basically, the reason why bootstrapping is done is because it is like getting new data from the population only you are using the sample you currently have as if it was that said population. The advantage of using bootstrap samples is that when you average out all the trees build from these randomly selected samples with replacement, it reduces the high variance individual trees suffer from without increasing bias. This is called bootstrap aggregation (which is a mouthful so instead we just call it bagging).

3 Bootstrap Samples

This figure shows how the process of bootstrapping works. From the original data A, we create three different bootstrap samples. Notice how in the bottom bootstrap sample \(\mathrm{A^{*3}}\) has two of the same observation numbered 1 because of the sampling with replacement. In contrast, we see bootstrap sample \(\mathrm{A^{*2}}\) does not contain any duplicate observations.

Random Forests

The beauty of random forests occurs by incorporating a small but genius tweak to bagging where it selects a portion of the available variables randomly (hence the name random forests) from the bootstrap samples to build trees of all shapes and sizes. When the model is forced to select from only a selected subsection \(m\) of all the original variables, it “decorrelates” the trees thus reducing model variance even further. A natural question that arises is how do you determine the size of the subsection of variables. Well the truth is you can try different sizes but the industry has found that using the square root of the total number of variables (or a slight variation of \(\sqrt{m}\)) usually tends to give the best result.

In keeping up with the Mario example, imagine Mario going to every tree here and recording the prediction score for the entire forest. Then, he averages out the scores to get a final prediction. That is exactly what the algorithm in random forests does.

You might be thinking, “well cool story bro but how do I know how many trees do I grow to make a random forest?” Here’s the kicker, you can grow trees to your heart’s content and it will never hurt you. What I mean is that you can’t overfit the data by growing too many trees. You’ll simply get to a point where the variance won’t decrease any further.

OOB Error Rate

Another big advantage of trees is the fact that you can compute something called the “out-of-bag” error rate. Recall we used bootstrap samples to build the trees that lead to the inception of a forest. Think of the bootstrap samples sort of like the soil where the forest is grown. Well, in those bootstrap samples, we find that the samples leave about 1/3 (the true approximation is \(\frac{1}{e}\)) of the original observations out of the bag. What is neat about this is that, you can “recycle” those unused observations by using them to test the performance of the model. It’s just like if you were using these OOB observations as your test set. At first glance, you won’t see the fun in that but you’ll come to appreciate it if you come across large data sets.

SMOTE

When dealing with categorical response variables, it is too often the case where you have an unequal amount of observations in each class. The performance of statistical methods is compromised when the classes are unequally represented. In order to remedy this, I would like to introduce Synthetic Minority Oversampling Technique (Chawla et al., 2002), also known as SMOTE, so that our algorithm can learn in a more unbiased way. While SMOTE doesn’t fall under the random forests family per se, it can improve prediction accuracy when deploying random forests or any other supervised machine learning method.

What SMOTE does is it creates “synthetic” observations to be over-sampled from the minority class (the group with the fewest observations). They are generated by finding the difference between the feature vector of the minority class and its nearest neighbor. Then, this difference is multiplied by a random number between 0 and 1. The creation of these new “synthetic” observations cause the classifier to make larger and less specific decisions regions. These decisions regions are now learned from the minority class samples to help random forests to generalize better.

Applied

My personal preference for statistical analysis is R but you can do random forests in a ton of other statistical software as well. R has it’s pros (this web page was built using R markdown!) and cons but I really enjoy using it. A wise Columbia professor once said: “the best thing about R is that it was written by statisticians, the worse thing about R is that it was written by statisticians” (you might’ve guessed by now that I’m a big quotes guy).

I’ve gone through a whole lot of stats books to look at their examples for random forests and SMOTE but found most of them to be unmemorable and mundane. In order to spice things up, I found a data set on Kaggle (here) that has information on the attributes of 800 pokemon (I don’t know how much spice Pokemon adds but it’s the worth the good ol’ try). For those of you who are not familiar with the Pokemon world, there are these fictitious creatures with special powers out in the wild and your goal is catch them, train them and battle them with other trainers. I know it sounds very similar to animal cruelty but it’s more like adopting a stray dog and training it to get to the World Cup of dog competitions while also becoming best friends in the process. These pokemon have attributes that are measurements of speed, health, type or defense as shown below. There are few pokemon that are considered to be “legendary”. While the exact criteria for a pokemon to qualify as legendary is vague, they are very rare and unique, few people have seen them and have been around since ancient times (think like the Loch Ness monster). Our goal today is see if we are able to predict whether a pokemon is legendary or not given only its attributes.

pokemon <- read.csv("~/Documents/website/pokemon.csv")
pokemon <- pokemon[-c(1,2)]
pokemon$Generation = as.factor(pokemon$Generation)

Before we do anything, we want to import our data and set it up for analysis. We remove the first column with the pokemon number (X.) and the pokemon name column since these variables have no usefulness in terms of prediction. We also make the generation variable to be a factor to force the algorithms to consider it as a discrete whole number since there are no such thing as a pokemon being part of generation like 1.5.

set.seed(123)
train <- sample(1:nrow(pokemon), nrow(pokemon)*(2/3), rep=FALSE)
test <- -train

training_data = pokemon[train,]
testing_data = pokemon[test,]

Next we must create our training and test sets. Here we split 2/3 of the data to be randomly assigned into the training set and the rest into the test set.

RF Test

nvar <- dim(pokemon)[2] - 1  #number of predictors = 9

class_rate = NULL
for(i in 1:100){
  set.seed(14)
  rf.pokemon=randomForest(Legendary~., data=training_data,
                          mtry=sqrt(nvar), ntree = i)
  
  yhat.rf=predict(rf.pokemon, newdata=testing_data,type="class")
  pokemon.test= testing_data$Legendary
  
  mean(pokemon.test!=yhat.rf)
  table(pokemon.test, yhat.rf)
  
  #Legendary Classification Test Rate (True Positives)
  class_rate[i]=table(pokemon.test,yhat.rf)[4]/table(yhat.rf=pokemon.test)[2]
}
max_class_rate = max(class_rate)
print(max_class_rate)

## [1] 0.5833333

Remember that we can make our random forest contain as many trees as you want but in order to find the optimal number of trees that give the highest classification rate for legendary pokemon, we loop from 1 tree all the way through 100 trees and record their classification rates.

Now the moment of truth is here. After going through 100 forests and using the one with highest classification rate, we want to test how good it classifies data it has not seen. Remember, we want to predict whether a pokemon is a legendary one given all of its attributes. In certain circles it is called the true positive rate. We found our random forest to predict (drum roll please…….) a meager 58.3% percent. What gives? That’s slightly better than a coin toss. After going through all that reading on how random forests are so awesome and great yet they failed to step up to the plate. Was the stage too big and lights too bright for random forests? By now, you should know that the Stats Whisperer won’t throw you to the wolves and leave you out in the cold. First, let’s take a closer look at the data.

summary(pokemon$Legendary)

## False  True 
##   735    65

prop.table(table(pokemon$Legendary))

## 
##   False    True 
## 0.91875 0.08125

summary(training_data$Legendary)

## False  True 
##   492    41

prop.table(table(training_data$Legendary))

## 
##      False       True 
## 0.92307692 0.07692308

summary(testing_data$Legendary)

## False  True 
##   243    24

prop.table(table(testing_data$Legendary))

## 
##      False       True 
## 0.91011236 0.08988764

In the entire data set, we only find 65 pokemon to be legendary meaning that they make up only about 8.1% of the 800 pokemon. If we dig deeper, we see that the training set also only has 7.6% (41 out of 533) of legendary pokemon. The test set falls to the same fate with only 8.9% (24 out of 267). While random forests are powerful, the amount of legendary pokemon is too small for the algorithm to develop good prediction accuracy. Since there aren’t anymore legendary pokemon we can feed into the algorithm, we’ll just have to intelligently create them ourselves (no pokemon were harmed in the making of new datapoints).

RF Test w/SMOTE

set.seed(123)
trainSMOTE <- SMOTE(Legendary~ ., training_data, perc.over = 100, perc.under=200)
summary(trainSMOTE$Legendary)

## False  True 
##    82    82

prop.table(table(trainSMOTE$Legendary))

## 
## False  True 
##   0.5   0.5

Remember our old pal SMOTE? We leave the test set as is but put the training set through the SMOTE algorithm to create an equally balanced data set between legendary and non-legendary pokemon. SMOTE synthetically created 41 additional pokemon from the original 41 for a total of 82 legendary observations while randomly selecting 82 non-legendary pokemon from the already available pokemon.

class_rate.SMOTE = NULL
for(i in 1:100){
  set.seed(14)
  rf.pokemon.SMOTE = randomForest(Legendary~., data=trainSMOTE,
                                  mtry=sqrt(nvar), ntree = i)
  
  yhat.rf.SMOTE = predict(rf.pokemon.SMOTE, newdata=testing_data,type="class")
  pokemon.test = testing_data$Legendary
  
  mean(pokemon.test!=yhat.rf.SMOTE)
  table(pokemon.test, yhat.rf.SMOTE)
  
  #Legendary Classification Rate (True Positives)
  class_rate.SMOTE[i]=table(pokemon.test, yhat.rf.SMOTE)[4]/table(yhat.rf.SMOTE=pokemon.test)[2]
}
max_class_rate.SMOTE = max(class_rate.SMOTE)
print(max_class_rate.SMOTE)

## [1] 0.9166667

In order to do an apples-to-apples comparison, we use the same amount of trees used to create the random forest with the lowest OOB error rate for the training set while also validating the SMOTE training data on the exact test set previously used. We arrive at the moment of truth where we finally get to see whether these new synthetic observations are for real or just a lousy scam. The random forest built with the SMOTE data set has a legendary pokemon classification test rate of (……..this is the final drumroll, I promise) 91.6%! SMOTE came to the rescue as we were able to increase our classification rate by a staggering 33.3 percentage points (or an increment of 57.1%)!

x = data.frame(class_rate.SMOTE)
x$NumTrees = seq.int(nrow(x))

ggplot(x,aes(NumTrees,class_rate.SMOTE)) + 
  geom_line(aes(color = "red")) + 
  geom_point(aes(color = "red")) + 
  xlab("Number of Trees") + 
  ylab("Legendary Pokemon Classification Rate") + 
  theme_hc(bgcolor = "darkunica") + 
  theme(axis.text.x = element_text(size=10,color="white"),
        axis.text.y = element_text(size=10,color="white"),
        axis.title.x = element_text(size=15,color="white"),
        axis.title.y = element_text(size=15,color="white"),
        legend.position = "none")

An important selling point highlighted in the reading was how growing more and more trees wouldn’t hurt you. Here we are plotting the legendary classification rates vs the number of trees in each forest so support this claim. We can see the classification rate levels out after about 20 trees while never decreasing after that.

Whether you’re a die-hard data science/machine learning/statistics fan or just big nerd like me that’ll read anything that has Pokemon or Super Mario in it, I hope you found this helpful while enjoying the ride. Thanks for reading.

References

Breiman, L. Machine Learning (2001) 45: 5. https://doi.org/10.1023/A:1010933404324
Caruana, R.,Niculescu-Mizil, A. (2006). An Empirical Comparison of Supervised Learning Algorithms. Proceedings of the 23rd international conference on Machine learning - ICML ’06. 2006. 161-168. 10.1145/1143844.1143865.
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321–357.
Hastie, T., Tibshirani, R.,, Friedman, J. (2001). The Elements of Statistical Learning. New York, NY, USA: Springer New York Inc..
James, G., Witten, D., Hastie, T., Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R (Vol. 103). New York: Springer. ISBN: 978-1-4614-7137-0

Github