Add supports for feature importances #166

yufongpeng · 2022-05-22T08:20:36Z

edit (@ablaom) Closes #170.

Needs:

Adabooststump algorithm change #167

I've added three methods for calculating feature importances.
The default feature_importances was calculating by Mean Decrease in Impurity which is calculated simultaneously with model building.
permutation_importances shuffles the columns multiple times and compares R2 or accuracy with the original model.
dropcol_importances deletes each columns and uses cross validation instead.

codecov-commenter · 2022-05-22T22:29:19Z

Codecov Report

Merging #166 (90d0b30) into dev (82c577c) will decrease coverage by 1.33%.
The diff coverage is 83.67%.

❗ Current head 90d0b30 differs from pull request most recent head f0d9677. Consider uploading reports for the commit f0d9677 to get more accurate results

@@            Coverage Diff             @@
##              dev     #166      +/-   ##
==========================================
- Coverage   89.51%   88.17%   -1.34%     
==========================================
  Files          10       10              
  Lines         992     1201     +209     
==========================================
+ Hits          888     1059     +171     
- Misses        104      142      +38

Impacted Files	Coverage Δ
src/DecisionTree.jl	`54.54% <61.11%> (ø)`
src/measures.jl	`93.43% <68.96%> (-4.20%)`	⬇️
src/scikitlearnAPI.jl	`53.52% <71.95%> (+13.10%)`	⬆️
src/classification/main.jl	`97.81% <98.78%> (+0.25%)`	⬆️
src/abstract_trees.jl	`100.00% <100.00%> (ø)`
src/classification/tree.jl	`100.00% <100.00%> (ø)`
src/regression/main.jl	`95.16% <100.00%> (+3.05%)`	⬆️
src/regression/tree.jl	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 82c577c...f0d9677. Read the comment docs.

ablaom · 2022-05-22T22:43:48Z

@yufongpeng Thanks for this contribution. A cursory review has convinced me that this is worthy and well-executed enhancement to DecisionTrees.jl.

I think I can make a good detailed review of this PR but it will a little time. In the meantime, can you please add document strings for the three new feature importance methods, and also a document string for your enhancement of prune_tree (which you did not mention above)? Ideally, point to some reference where the algorithm is documented.

Additionally, we will need to add tests to this PR. The current tests do not seem adequate, as far as I can tell.

yufongpeng · 2022-05-27T18:53:21Z

There's some changes for this branch:

I move dropcol_importances to scikeatLearnAPI. Cross validation is optional, but I can not test cv unless there is a function to do cv with built-in nfoldCV on sklearn wrappers.
permutation_importances can only do multiple shuffles for native trees but can do cv for sklearn wrappers. However, as same as dropcol_importances, this can not be tested.
The modification for prune_tree is just recalculating feature_importances (modifying property featim) if necessary.
Docstrings and more tests are added.
I use similarity (cosine of two importance vectors) to quantify their similarity; I'm not sure if this is adequate.
I tend to compare same kind of feature importances between models with different hyperparameters, but some are not available, especially for randomforests and adabooststumps. Some of these tests are very robust but many are quite unstable between julia versions, so I deprecate a lot of such tests.

src/classification/tree.jl

src/DecisionTree.jl

Project.toml

src/classification/main.jl

src/measures.jl

src/classification/main.jl

…nto dev

yufongpeng · 2022-06-22T15:08:19Z

@ablaom I think this PR is almost ready, but there's one thing I need to fix. In the new version of the prune_tree, I only consider classification model and update impurity_importance based on entropy (or other discrete metric like gini) because prune_tree is based on accuracy. In the test file, there's a test on regression tree using prune_tree and the result of impurity_importance is definitely wrong. I'm able to fix this but just not sure if using accuracy on regression tree is OK and we should avoid using prune on a regression tree.

ablaom · 2022-06-26T20:25:46Z

src/classification/main.jl

+
+Prune tree based on prediction accuracy of each nodes.
+* `purity_thresh`: If prediction accuracy of a stump is larger than this value, the node will be pruned and become a leaf.
+* `loss`: The loss function for computing node impurity. It can be either `util.entropy`, `util.gini` or other measures of impurity depending on the loss function used for building the tree. 


I'm finding this one sentence a little confusing. Do you mean to say that the loss can be any of these, but would generally be chosen to match the loss used in building the tree (but doesn't have to be)?

Yes, you can change the loss function freely because the tree did not store the loss function and it has to be provided externally. I've changed the algorithm a little bit: If a tree Root{S, T} where T is Float64 , the loss function will be mean_squared_error; otherwise, it will be util.entropy.

src/classification/main.jl

ablaom · 2022-06-26T20:34:31Z

@yufongpeng I agree that an accuracy-based prune_tree does not make sense for regression and we can remove that test. Please add a comment to the prune_tree doc-string that warns against use in regression.

ablaom · 2022-06-27T21:15:35Z

src/classification/main.jl


 Prune tree based on prediction accuracy of each nodes.
 * `purity_thresh`: If prediction accuracy of a stump is larger than this value, the node will be pruned and become a leaf.
-* `loss`: The loss function for computing node impurity. It can be either `util.entropy`, `util.gini` or other measures of impurity depending on the loss function used for building the tree. 
+* `loss`: The loss function for computing node impurity. For classification tree, default function is `util.entropy`; for regression tree, it's `mean_squared_error`. 


Two points:

util.entropy is not in the DecisionTree namespace. So unless we decide to export this, we should write DecisionTree.util.entropy

We should mention DecisionTree.util.gini as an available option, as user is unlikely to discover it otherwise.

kescobo · 2022-06-28T13:40:04Z

Apologies if this is orthogonal to the PR itself, and also if this is a naive question, but I'm anxious to get some results that make use of it. Are changes required to MLJDecisionTreeInterface in order for me to use this via MLJ? I've added @yufongpeng 's fork, and trained a machine using the MLJ interface. I tried using impurity_importance() on the machine, as well as mach.model, but neither worked.

Is there some other way to do this, or do I need to use the DecisionTree API directly instead?

OkonSamuel · 2022-06-28T13:52:39Z

Apologies if this is orthogonal to the PR itself, and also if this is a naive question, but I'm anxious to get some results that make use of it. Are changes required to MLJDecisionTreeInterface in order for me to use this via MLJ? I've added @yufongpeng 's fork, and trained a machine using the MLJ interface. I tried using impurity_importance() on the machine, as well as mach.model, but neither worked.

Is there some other way to do this, or do I need to use the DecisionTree API directly instead?

I'm currently working on MLJ API for feature importance. I'll be done with it this week.

kescobo · 2022-06-28T14:13:53Z

Got it, thanks for putting this together! I was able to use it on the DecisionTree.jl model directly for now, but will look forward when it's ready in MLJ 😄

rikhuijzer

The diff doesn't only show what this PR changes but also contains the changes of #174 and other PRs. This is probably caused by a wrong merge of dev into this PR. I don't know how to fix it via Git, but you could copy the files that you've changed over to a new PR based on dev and then it should all be looking good again

EDIT: Also, I would suggest next time to create a PR on a new branch in the fork. That way, upstream changes can be merged into the dev branch on the fork and then merged into the new branch. Currently, the PR is at the dev branch itself so it isn't so easy to merge changes into it from the original dev

OkonSamuel · 2022-06-29T11:01:11Z

src/measures.jl

+    base = score(trees, labels, features)
+    scores = Matrix{Float64}(undef, size(features, 2), n_iter)
+    rng = mk_rng(rng)::Random.AbstractRNG
+    for (i, col) in enumerate(eachcol(features))


eachcol methods requires at least julia 1.1. I suggest we bump the minimum julia compat for DecisionTrees to "1.1".
@ablaom @yufongpeng.

It is already at 1.6 (#179) but this PR is in a bad state.

kescobo · 2022-06-29T19:01:21Z

I don't know how to fix it via Git, but you could copy the files that you've changed over to a new PR based on dev and then it should all be looking good again

I just attempted a rebase, but there was a lot going on that I didn't totally understand. Someone is going to have to be careful to not undo stuff that's happened on dev since this PR was started.

ablaom · 2022-06-29T22:27:29Z

@yufongpeng Can you please try @rikhuijzer 's suggestions to clean up the git issues?

ablaom · 2022-07-01T00:28:55Z

Closed in favour of #182

yufongpeng added 10 commits May 20, 2022 17:39

add impoertances

1d5e6f7

add importance

74c7782

Bug fix

a3d52fe

fix method overwitten

ca6ca0c

feature importance for sklearnAPI

e05a0ae

using CV for dropcol importances

a660a7c

unify # apply arguments

aa51673

add test for feature importances

2912b87

Bug fix for tests

a1924aa

Compatible with julia 1.0

812f0f9

yufongpeng added 8 commits May 28, 2022 01:16

Bug fix

2537b80

Move dropcol to ScikiLearnAPI, add doc

ff61c12

New tests for importances

320c254

Compatibility for julia 1.0

29590e6

Relax tolerance in test

d29039f

Compatibility for julia 1.0

bb7fd23

Relax tolerance in test

1e5a149

Compatibility for julia 1.0

f0d9677