Decision Tree
Decision Trees
Decision trees are machine learning models that try to find patterns in the features of data points.Decision Trees Construction
Decision
Trees are usually constructed from top to bottom. At each level of the
tree, the feature that best splits the training set labels is selected
as the “question” of that level. Two different criteria are available to
split a node, Gini Index and Information Gain. The convenience of one
or the other depends on the problem.
Decision Trees in scikit-learn
The
sklearn.tree
module contains the DecisionTreeClassifier
class. Information Gain at decision trees
When making decision trees, two different methods are used to find the best feature to split a dataset on: Gini impurity and Information Gain. An intuitive interpretation of Information Gain is that it is a measure of how much information the individual features provide us about the different classes.Gini impurity
When making decision trees, calculating the Gini impurity of a set of data helps determine which feature best splits the data. If a set of data has all of the same labels, the Gini impurity of that set is 0. The set is considered pure. Gini impurity is a statistical measure - the idea behind its definition is to calculate how accurate it would be to assign labels at random, considering the distribution of actual labels in that subset.Decision trees leaf creation
When making a decision tree, a leaf node is created when no features result in any information gain.Decision Tree Representation
In
a decision tree, leaves represent class labels, internal nodes
represent a single feature, and the edges of the tree represent possible
values of those features.
Decision trees pruning
Decision
trees can be overly complex which can result in overfitting. A
technique called pruning can be used to decrease the size of the tree to
generalize it to increase accuracy on a test set. Pruning is not an
exact method, as it is not clear which should be the ideal size of the
tree. This technique can be made bottom-up (starting at the leaves) or
up-bottom (starting at the root).
Decision Tree Limitations
Our current strategy of creating trees is greedy. We assume that the best way to create a tree is to find the feature that will result in the largest information gain right now and split on that feature.
Another problem with our trees is that they potentially overfit
the data. This means that the structure of the tree is too dependent on
the training data and doesn’t accurately represent the way the data in
the real world looks like.
One way to solve this problem is to prune the tree. The goal of pruning is to shrink the size of the tree.
Comments
Post a Comment