Random Forest - AI & ML Insights

Back to: Machine Learning

Random Forest is an ensemble learning algorithm that extends the bagging technique to improve the performance of decision trees. It involves building multiple decision trees on bootstrapped samples of the data, with the added step of random feature selection at each split. After the trees are trained, their predictions are aggregated to make a final decision, enhancing accuracy and reducing overfitting compared to individual decision trees.

EXAMPLE

Dataset:

Study Hours	Attendance (Days)	Target (Pass/Fail)
3	15	Pass
5	20	Pass
2	10	Fail
4	18	Pass

Step 1: Bootstrapping

Bootstrapping is a technique where we sample data with replacement. This means we randomly select data points, and some data points might be selected multiple times, while others might not be selected at all.

Bootstrap Sample 1:

We create the first bootstrapped sample by randomly selecting data points (with replacement):

Study Hours	Attendance (Days)	Target (Pass/Fail)
5	20	Pass
3	15	Pass
4	18	Pass
3	15	Pass

Bootstrap Sample 2:

Now, create the second bootstrapped sample by randomly selecting data points again:

Study Hours	Attendance (Days)	Target (Pass/Fail)
2	10	Fail
5	20	Pass
4	18	Pass
2	10	Fail

Step 2: Building Decision Trees

Now that we have two bootstrapped samples, we’ll build a simple decision tree for each sample. We will split the data based on Study Hours and Attendance.

Decision Tree for Bootstrap Sample 1:

In Bootstrap Sample 1, we have the following data:

Study Hours	Attendance (Days)	Target (Pass/Fail)
5	20	Pass
3	15	Pass
4	18	Pass
3	15	Pass

Let’s try splitting the data based on Study Hours first:

If Study Hours ≤ 3:
- Data: [3, 15] → Pass
- Prediction: Pass
If Study Hours > 3:
- Data: [5, 20], [4, 18] → Pass
- Prediction: Pass

So, the decision tree based on Study Hours looks like:

This tree will predict Pass for all data points because all points in the sample are Pass.

Decision Tree for Bootstrap Sample 2:

In Bootstrap Sample 2, we have the following data:

Study Hours	Attendance (Days)	Target (Pass/Fail)
2	10	Fail
5	20	Pass
4	18	Pass
2	10	Fail

Let’s try splitting the data based on Study Hours first:

If Study Hours ≤ 3:
- Data: [2, 10], [2, 10] → Fail
- Prediction: Fail
If Study Hours > 3:
- Data: [5, 20], [4, 18] → Pass
- Prediction: Pass

So, the decision tree based on Study Hours looks like:

This tree will predict Fail for study hours ≤ 3 and Pass for study hours > 3.

Step 3: Making Predictions with Each Tree

Now, let’s make predictions for a new data point: Study Hours = 4, Attendance = 18.

Prediction using Decision Tree 1 (from Bootstrap Sample 1):
- Study Hours = 4 falls in the Study Hours > 3 branch.
- Prediction: Pass
Prediction using Decision Tree 2 (from Bootstrap Sample 2):
- Study Hours = 4 falls in the Study Hours > 3 branch.
- Prediction: Pass

Step 4: Aggregating Predictions (Majority Voting)

Finally, we aggregate the predictions from both decision trees using majority voting:

Decision Tree 1 predicts Pass.
Decision Tree 2 predicts Pass.

Since both decision trees predict Pass, the final prediction is Pass.

Summary of Steps:

Bootstrapping: We created two bootstrapped samples by randomly selecting data points with replacement.
Decision Tree Creation: We built simple decision trees based on Study Hours for both bootstrapped samples.
Prediction: We made predictions using the decision trees.
Aggregation: We aggregated the predictions using majority voting to get the final result.