Decision Trees

Decision trees are machine learning models that try to find patterns in the features of data points.

Decision Trees Construction

Decision Trees are usually constructed from top to bottom. At each level of the tree, the feature that best splits the training set labels is selected as the “question” of that level. Two different criteria are available to split a node, Gini Index and Information Gain. The convenience of one or the other depends on the problem.

Decision Trees in scikit-learn

The sklearn.tree module contains the DecisionTreeClassifier class.

Information Gain at decision trees

When making decision trees, two different methods are used to find the best feature to split a dataset on: Gini impurity and Information Gain. An intuitive interpretation of Information Gain is that it is a measure of how much information the individual features provide us about the different classes.

Gini impurity

When making decision trees, calculating the Gini impurity of a set of data helps determine which feature best splits the data. If a set of data has all of the same labels, the Gini impurity of that set is 0. The set is considered pure. Gini impurity is a statistical measure - the idea behind its definition is to calculate how accurate it would be to assign labels at random, considering the distribution of actual labels in that subset.

Decision trees leaf creation

When making a decision tree, a leaf node is created when no features result in any information gain.

Decision Tree Representation

In a decision tree, leaves represent class labels, internal nodes represent a single feature, and the edges of the tree represent possible values of those features.

Decision trees pruning

Decision trees can be overly complex which can result in overfitting. A technique called pruning can be used to decrease the size of the tree to generalize it to increase accuracy on a test set. Pruning is not an exact method, as it is not clear which should be the ideal size of the tree. This technique can be made bottom-up (starting at the leaves) or up-bottom (starting at the root).

Decision Tree Limitations

Our current strategy of creating trees is greedy. We assume that the best way to create a tree is to find the feature that will result in the largest information gain right now and split on that feature.

Another problem with our trees is that they potentially overfit the data. This means that the structure of the tree is too dependent on the training data and doesn’t accurately represent the way the data in the real world looks like.

One way to solve this problem is to prune the tree. The goal of pruning is to shrink the size of the tree.

Search This Blog

Data Science

Decision Tree

Decision Trees

Decision Trees Construction

Decision Trees in scikit-learn

Information Gain at decision trees

Gini impurity

Decision trees leaf creation

Decision Tree Representation

Decision trees pruning

Decision Tree Limitations

Comments

Post a Comment

Popular posts from this blog

Support Vector Machines

Random Forest

ACCURACY, RECALL, PRECISION, F1 SCORE