Venturing into the Random Forest

From Technospire

Venturing into the random forest

Before we venture into random forests, we must first understand what decision trees are.

A decision tree is a machine learning algorithm that is in essence a series of decisions that can help in regression and classification tasks, (a regression task predicts a continuous numerical value, such as predicting house prices, whereas classification tasks assign an input to predefined categories, for example is an email spam or not).

How decision trees work for a classification task

Now you might already be questioning the machine-learning aspect of decision trees. After all, a decision tree is practically a lot of nested if statements. But what if we didn’t know what the conditions in the if statements were and we wanted the computer to calculate the best decisions to make, to accurately classify new data points? Well, what we can do is train the algorithm to recursively split up the data into 2 sections and pick the side with the smallest number of impurities. The way to quantify the number of impurities in this is by using entropy. Entropy is the surprise associated with a random event and we can use this to calculate the information gain which is the entropy of the data before a split subtracted by the entropy after the split. After splitting the data into 2 sections in all the ways possible, we pick the split that gives us the biggest information gain. Note, that sometimes the data needs to be split up multiple times to correctly classify it.

As you could see, if we wanted to classify a red point then we can’t just draw one line to correctly differentiate between the two types of points. In this example, at least 4 lines are required to correctly highlight the region where all the red points lie. You can imagine these 4 lines as the branches in the decision tree or as a nested if statement.

Here is the decision tree for these data points, and you can start to see that the problem with decision trees lies within their simplicity. As soon as we change around a couple of data points, we have the possibility of getting a completely different tree. The tree is highly sensitive to changes in the training data which means that when we test our algorithm it might fail to generalize, since decision trees, are not flexible giving us inaccurate results.

How a random forest works for a classification task.

This is where the random forest comes in. Random forests take the simplicity of the decision tree but also provide it with more flexibility meaning that it is not so easily affected by the training data. First, let’s break down what exactly is meant by the two words “Random Forest”. In a real data set, we would have hundreds of records each with several features. The word “random” refers to randomly selecting many records and creating a bootstrapped dataset. Here is an example dataset:

In this data set, we want to predict, if given a new sample, whether a person completed the task or not. If we were to now create a bootstrapped data set from this, it would look something like the table below:

Now, we have a bootstrapped data set, we want to create a decision tree out of it. To do this, we randomly select features and check which one is the best at splitting the data up. We don’t select all the features because we need to make sure that the decision trees that we create are not too similar to each other. Research has shown that picking the square root of (totalnumberofattributes) gives the best predictions. So, let’s take 2 attributes (since the square root of 4 is 2) - hours spent on the task and income. Now let’s create a decision tree based on this, trying to classify who completed the task compared to who did not.

The word “forest” refers to creating multiple decision trees. So, we repeat the process above n number of times until we are satisfied.

 Please note that these decision trees are purely for illustration purposes and are not the best splits.

Now that we have multiple decision trees, we enter a new record without knowing if they completed the task or not.

Using the decision trees, we input the data points and look at what they output. The final output is the mode of the output (if the decision trees outputted 1 more times than it outputted 0 then the final answer will be 1, the same vice versa).

This process is called aggregation and so the combination of bootstrapping and then aggregating is called “bagging” to find a final result. Now we have created the random forest algorithm! We can use this method for a regression task, by simply finding the mean vote rather than the mode.

A potential con of random feature selection is that there is a possibility that we are testing the data on features that are not so important to the final result, therefore giving inaccurate predictions. However, there is an equal possibility that the opposite happens so this cancels out in general and does not affect the output.

Evaluating our algorithm

Before we release our algorithm into the wild, we need a way to measure its accuracy, to see if it is a worthwhile tool to implement, or if we should use a different algorithm. To do this we use any records that have not been selected in our bootstrapped dataset (typically about 1/3 of the original dataset) and run it through the decision tree that did not use the record. We use aggregation to output the final answer. We can calculate the accuracy by finding the ratio of correctly classified out-of-bag records to incorrectly classified datasets. We can try and optimize this algorithm by changing some of the settings such as selecting a different number of features and seeing how that affects the accuracy, or by increasing the number of bootstrapped datasets we create, for example.

We have now made it out of the random forest!

In conclusion, random forests are one of the most powerful machine learning algorithms used for classification tasks. They are not so widely used for regression tasks as creating a decision tree for continuous data, since it makes evaluating the algorithm a lot harder as we need to give a margin of error to the mean generated while doing aggregation. Also, I skimmed over the math of making a decision tree, although I can go over that in a future article. I hope you enjoyed the journey…