Random Forest

Bagging
Decision tree

Random Forest is an ensemble learning method that improves upon bagging by reducing correlation between decision trees. It achieves this by creating multiple decorrelated trees and combining their predictions through averaging.

Bagging trees leads to reduction in variance but not bias. The idea in random forest is to improve the variance reduction of bagging by reducing the correlation between trees. This is achieved through random selection of the input variables/features. 1) For each tree: Pick a bootstrap sample of data 2) For each split: Pick random sample of the features. Builds a large collection of de-correlated trees and averages them.

Key Characteristics

Similar performance to boosting but simpler to train and tune.

Creates a "forest" of multiple decision trees

Built on
Bagging principles

Generates predictions by averaging results from decorrelated trees

Offers comparable performance to boosting while being more straightforward to implement and optimize

When creating trees, each node splits on a random subset of features to ensure diversity in the forest structure.

Impurity and Information Gain

The splitting criterion at each node aims to minimize impurity (or maximize homogeneity) in the resulting subsets. In information theory, this reduction in uncertainty is known as Information Gain.

Extra-Trees

Extra-Trees (Extremely Randomized Trees) takes randomization a step further. Instead of searching for optimal thresholds for node splitting, it randomly generates splitting points and selects the best among these random splits. This approach significantly reduces computational time compared to standard Random Forest, which searches for optimal thresholds at each node.

A key advantage of Random Forest is its ability to measure feature importance. In Scikit-Learn, this is calculated by measuring how much each feature reduces impurity when used in node splitting decisions.