Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add supports for feature importances #166

Closed
wants to merge 44 commits into from
Closed

Add supports for feature importances #166

wants to merge 44 commits into from

Conversation

yufongpeng
Copy link
Collaborator

@yufongpeng yufongpeng commented May 22, 2022

edit (@ablaom) Closes #170.

Needs:


I've added three methods for calculating feature importances.
The default feature_importances was calculating by Mean Decrease in Impurity which is calculated simultaneously with model building.
permutation_importances shuffles the columns multiple times and compares R2 or accuracy with the original model.
dropcol_importances deletes each columns and uses cross validation instead.

@codecov-commenter
Copy link

codecov-commenter commented May 22, 2022

Codecov Report

Merging #166 (90d0b30) into dev (82c577c) will decrease coverage by 1.33%.
The diff coverage is 83.67%.

❗ Current head 90d0b30 differs from pull request most recent head f0d9677. Consider uploading reports for the commit f0d9677 to get more accurate results

@@            Coverage Diff             @@
##              dev     #166      +/-   ##
==========================================
- Coverage   89.51%   88.17%   -1.34%     
==========================================
  Files          10       10              
  Lines         992     1201     +209     
==========================================
+ Hits          888     1059     +171     
- Misses        104      142      +38     
Impacted Files Coverage Δ
src/DecisionTree.jl 54.54% <61.11%> (ø)
src/measures.jl 93.43% <68.96%> (-4.20%) ⬇️
src/scikitlearnAPI.jl 53.52% <71.95%> (+13.10%) ⬆️
src/classification/main.jl 97.81% <98.78%> (+0.25%) ⬆️
src/abstract_trees.jl 100.00% <100.00%> (ø)
src/classification/tree.jl 100.00% <100.00%> (ø)
src/regression/main.jl 95.16% <100.00%> (+3.05%) ⬆️
src/regression/tree.jl 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 82c577c...f0d9677. Read the comment docs.

@ablaom
Copy link
Member

ablaom commented May 22, 2022

@yufongpeng Thanks for this contribution. A cursory review has convinced me that this is worthy and well-executed enhancement to DecisionTrees.jl.

I think I can make a good detailed review of this PR but it will a little time. In the meantime, can you please add document strings for the three new feature importance methods, and also a document string for your enhancement of prune_tree (which you did not mention above)? Ideally, point to some reference where the algorithm is documented.

Additionally, we will need to add tests to this PR. The current tests do not seem adequate, as far as I can tell.

@yufongpeng
Copy link
Collaborator Author

There's some changes for this branch:

  1. I move dropcol_importances to scikeatLearnAPI. Cross validation is optional, but I can not test cv unless there is a function to do cv with built-in nfoldCV on sklearn wrappers.
  2. permutation_importances can only do multiple shuffles for native trees but can do cv for sklearn wrappers. However, as same as dropcol_importances, this can not be tested.
  3. The modification for prune_tree is just recalculating feature_importances (modifying property featim) if necessary.
  4. Docstrings and more tests are added.
    I use similarity (cosine of two importance vectors) to quantify their similarity; I'm not sure if this is adequate.
    I tend to compare same kind of feature importances between models with different hyperparameters, but some are not available, especially for randomforests and adabooststumps. Some of these tests are very robust but many are quite unstable between julia versions, so I deprecate a lot of such tests.

@yufongpeng
Copy link
Collaborator Author

@ablaom I think this PR is almost ready, but there's one thing I need to fix. In the new version of the prune_tree, I only consider classification model and update impurity_importance based on entropy (or other discrete metric like gini) because prune_tree is based on accuracy. In the test file, there's a test on regression tree using prune_tree and the result of impurity_importance is definitely wrong. I'm able to fix this but just not sure if using accuracy on regression tree is OK and we should avoid using prune on a regression tree.


Prune tree based on prediction accuracy of each nodes.
* `purity_thresh`: If prediction accuracy of a stump is larger than this value, the node will be pruned and become a leaf.
* `loss`: The loss function for computing node impurity. It can be either `util.entropy`, `util.gini` or other measures of impurity depending on the loss function used for building the tree.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding this one sentence a little confusing. Do you mean to say that the loss can be any of these, but would generally be chosen to match the loss used in building the tree (but doesn't have to be)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can change the loss function freely because the tree did not store the loss function and it has to be provided externally. I've changed the algorithm a little bit: If a tree Root{S, T} where T is Float64 , the loss function will be mean_squared_error; otherwise, it will be util.entropy.

@ablaom
Copy link
Member

ablaom commented Jun 26, 2022

@yufongpeng I agree that an accuracy-based prune_tree does not make sense for regression and we can remove that test. Please add a comment to the prune_tree doc-string that warns against use in regression.


Prune tree based on prediction accuracy of each nodes.
* `purity_thresh`: If prediction accuracy of a stump is larger than this value, the node will be pruned and become a leaf.
* `loss`: The loss function for computing node impurity. It can be either `util.entropy`, `util.gini` or other measures of impurity depending on the loss function used for building the tree.
* `loss`: The loss function for computing node impurity. For classification tree, default function is `util.entropy`; for regression tree, it's `mean_squared_error`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two points:

  • util.entropy is not in the DecisionTree namespace. So unless we decide to export this, we should write DecisionTree.util.entropy

  • We should mention DecisionTree.util.gini as an available option, as user is unlikely to discover it otherwise.

@kescobo
Copy link

kescobo commented Jun 28, 2022

Apologies if this is orthogonal to the PR itself, and also if this is a naive question, but I'm anxious to get some results that make use of it. Are changes required to MLJDecisionTreeInterface in order for me to use this via MLJ? I've added @yufongpeng 's fork, and trained a machine using the MLJ interface. I tried using impurity_importance() on the machine, as well as mach.model, but neither worked.

Is there some other way to do this, or do I need to use the DecisionTree API directly instead?

@OkonSamuel
Copy link
Member

Apologies if this is orthogonal to the PR itself, and also if this is a naive question, but I'm anxious to get some results that make use of it. Are changes required to MLJDecisionTreeInterface in order for me to use this via MLJ? I've added @yufongpeng 's fork, and trained a machine using the MLJ interface. I tried using impurity_importance() on the machine, as well as mach.model, but neither worked.

Is there some other way to do this, or do I need to use the DecisionTree API directly instead?

I'm currently working on MLJ API for feature importance. I'll be done with it this week.

@kescobo
Copy link

kescobo commented Jun 28, 2022

Got it, thanks for putting this together! I was able to use it on the DecisionTree.jl model directly for now, but will look forward when it's ready in MLJ 😄

Copy link
Member

@rikhuijzer rikhuijzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff doesn't only show what this PR changes but also contains the changes of #174 and other PRs. This is probably caused by a wrong merge of dev into this PR. I don't know how to fix it via Git, but you could copy the files that you've changed over to a new PR based on dev and then it should all be looking good again

EDIT: Also, I would suggest next time to create a PR on a new branch in the fork. That way, upstream changes can be merged into the dev branch on the fork and then merged into the new branch. Currently, the PR is at the dev branch itself so it isn't so easy to merge changes into it from the original dev

base = score(trees, labels, features)
scores = Matrix{Float64}(undef, size(features, 2), n_iter)
rng = mk_rng(rng)::Random.AbstractRNG
for (i, col) in enumerate(eachcol(features))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eachcol methods requires at least julia 1.1. I suggest we bump the minimum julia compat for DecisionTrees to "1.1".
@ablaom @yufongpeng.

Copy link
Member

@rikhuijzer rikhuijzer Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already at 1.6 (#179) but this PR is in a bad state.

@kescobo
Copy link

kescobo commented Jun 29, 2022

I don't know how to fix it via Git, but you could copy the files that you've changed over to a new PR based on dev and then it should all be looking good again

I just attempted a rebase, but there was a lot going on that I didn't totally understand. Someone is going to have to be careful to not undo stuff that's happened on dev since this PR was started.

@ablaom
Copy link
Member

ablaom commented Jun 29, 2022

@yufongpeng Can you please try @rikhuijzer 's suggestions to clean up the git issues?

@yufongpeng yufongpeng mentioned this pull request Jun 30, 2022
@ablaom
Copy link
Member

ablaom commented Jul 1, 2022

Closed in favour of #182

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Add a Field for Feature Importance
6 participants