Random Forest

April 02, 2020

RANDOM FOREST

A random forest is an ensemble machine learning model. It makes a classification by aggregating the classifications of many decision trees.

Random forests are used to avoid overfitting. By aggregating the classification of multiple trees, having overfitted trees in a random forest is less impactful.

Every decision tree in a random forest is created by using a different subset of data points from the training set. Those data points are chosen at random with replacement, which means a single data point can be chosen more than once. This process is known as bagging.

When creating a tree in a random forest, a randomly selected subset of features are considered as candidates for the best splitting feature. If your dataset has n features, it is common practice to randomly select the square root of n features.

Boosting Steps :

Draw a random subset of training samples d1 without replacement from the training set D to train a weak learner C1
Draw second random training subset d2 without replacement from the training set and add 50 percent of the samples that were previously falsely classified/misclassified to train a weak learner C2
Find the training samples d3 in the training set D on which C1 and C2 disagree to train a third weak learner C3
Combine all the weak learners via majority voting.

Bagging :

Before understand Bagging lets understand the concept of Bootstrap which is nothing but choosing a Random sample with replacement.

As everyone pointed Bagging is nothing but Bootstrap AGGregatING

Generate n different bootstrap training sample
Train Algorithm on each bootstrapped sample separately
Average the predictions at the end

One of the Key differences is the way how use sample each training set. Bagging allows replacement in bootstrapped sample but Boosting doesn’t.

In theory Bagging is good for reducing variance( Over-fitting) where as Boosting helps to reduce both Bias and Variance as per this Boosting Vs Bagging, but in practice Boosting (Adaptive Boosting) know to have high variance because of over-fitting

Search This Blog

Data Science

Random Forest

Comments

Post a Comment

Popular posts from this blog

Support Vector Machines

ACCURACY, RECALL, PRECISION, F1 SCORE