Skip to content

Conversation

@ibelyakov
Copy link

The issue happens when one “pure“ node (with impurity* = 0) is presented in the tree. We calculate an impurity only for children nodes and not for the current node, as well as do not check whether the node is “pure“ and contains just one label, due to that, the “bestSplit” calculation is executed for the already “pure“ node, which decides that all items should be moved to the left child node and no items to the right (leaf node), which gives 2 “pure“ children nodes. Since we don’t calculate impurity for the current (parent) node the parentNode.getImpurity() - split.get().getImpurity() > minImpurityDelta check is always true, and we continue to split the already “pure“ node until the max tree depth is reached.
The following changes were made to resolve the issue:

  1. Gain** calculation and check for the split were added.
  2. Node’s impurity check is added, once the impurity becomes 0 it means that the node is “pure” and we don’t need to calculate a split for it.
  3. Gini impurity calculation was changed to (1 - sum(p^2)) to get the correct values in the range from 0 to 0.5 as required for the Gini index.

* Impurity - is a value from 0 to 0.5, which shows whether the node is “pure“ (impurity = 0) having just 1 label or “impure” with impurity=0.5, which is the worst scenario where the label ratio is 1:1.
** Gain - is a difference between the parent node’s impurity and weighted children nodes' impurity. The split which provides the maximum gain value is considered the best. See https://www.learndatasci.com/glossary/gini-impurity/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant