Skip to content
This repository was archived by the owner on Feb 28, 2025. It is now read-only.

input validation added to roc #230

Merged
merged 12 commits into from
Jan 29, 2023
Merged

Conversation

yafimvo
Copy link
Contributor

@yafimvo yafimvo commented Jan 22, 2023

Describe your changes

  1. Input validation added to roc
  2. Merged with the latest version (issue 146)
  3. Removed the ROC testing image base to test_roc

plot.ROC.from_raw_data and plot.roc take 2 inputs: y_test and y_score in the following formats

y_test

"classes"

[0, 1, 2, 0, 1, ...]
['virginica', 'versicolor', 'virginica', 'setosa', ...]

"one-hot encoded classes"

[[0, 0, 1],
 [1, 0, 0]]

y_score

"scores"

[[0.1, 0.1, 0.8],
 [0.7, 0.15, 0.15]]

Issue ticket number and link

Closes #98

Checklist before requesting a review

  • I have performed a self-review of my code
  • I have added thorough tests (when necessary).
  • I have added the right documentation (when needed). Product update? If yes, write one line about this update.

📚 Documentation preview 📚: https://sklearn-evaluation--230.org.readthedocs.build/en/230/

Updated Validating Elbow Curve Input Model (ploomber#219)

* added parameterized tests

added to changelog

better changelog message

validating input model

* conf.py edit

* updated validation for elbow curve

* conditional to check for python 3.7

linting

conditional to check for python 3.7

* changed conditional

* try-catch block, better skipif

lint

* Empty commit

Empty commit

* skipif failing fix

* resolving name error

* fixed conditional

updates newsletter url

changelog and docs updated
@coveralls
Copy link

coveralls commented Jan 22, 2023

Pull Request Test Coverage Report for Build 4005213428

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 56 of 73 (76.71%) changed or added relevant lines in 3 files are covered.
  • 4 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.2%) to 94.014%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/sklearn_evaluation/grid/random_forest_classifier_grid.py 6 10 60.0%
src/sklearn_evaluation/util.py 11 16 68.75%
src/sklearn_evaluation/plot/roc.py 39 47 82.98%
Files with Coverage Reduction New Missed Lines %
src/sklearn_evaluation/plot/roc.py 1 94.97%
src/sklearn_evaluation/grid/random_forest_classifier_grid.py 3 88.52%
Totals Coverage Status
Change from base Build 4001524496: -0.2%
Covered Lines: 2780
Relevant Lines: 2957

💛 - Coveralls

@yafimvo yafimvo requested review from edublancas and idomic January 22, 2023 11:59
@idomic idomic requested a review from neelasha23 January 24, 2023 15:25
@idomic
Copy link
Contributor

idomic commented Jan 24, 2023

@yafimvo please resolve conflicts

Copy link
Contributor

@edublancas edublancas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found two issues:

image

gives this error:

ValueError: classes [0 1] mismatch with the labels [0 1 2] found in the data

looks like it thinks it only has two classes, but it has three. but where is the 2 coming from?

to reproduce:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn_evaluation import plot
from sklearn import preprocessing

iris = load_iris()
X, y = iris.data, iris.target
y_ = iris.target_names[y]

random_state = np.random.RandomState(0)
n_samples, n_features = X.shape

X = np.concatenate([X, random_state.randn(n_samples, 200 * n_features)], axis=1)
(
    X_train,
    X_test,
    y_train,
    y_test,
) = train_test_split(X, y_, test_size=0.5, stratify=y, random_state=0)

classifier = LogisticRegression()
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)

lb = preprocessing.LabelBinarizer()
lb.fit(y_test)
y_test_bin = lb.transform(y_test)

roc = plot.ROC.from_raw_data(y_test_bin, y_score)

also, when calling this (same code as above):

# try to break it by passing y_test_bin in y_score
roc = plot.ROC.from_raw_data(y_test, y_test_bin)

the error shows:

ValueError: Please check y_score values. 
Expected scores array-like. got: [[0 0 1]
 [0 1 0]
 [0 0 1]
 [1 0 0]
 [0 0 1]
 [1 0 0]
 [0 1 0]
 [1 0 0]
 [0 0 1]
 [0 1 0]
 [1 0 0]
[really long error]

i see that the error is displaying the whole input, however, it's very long. It'd be better to just display the first few characters:

probably something like this:

image

@yafimvo
Copy link
Contributor Author

yafimvo commented Jan 25, 2023

@edublancas

In one of our examples we try to plot a roc curve using the DecisionTreeClassifier and as a result, y_score is in an invalid format

tree_score, forest_score = [
    est.fit(X_train, y_train).predict_proba(X_test)
    for est in [DecisionTreeClassifier(), RandomForestClassifier()]
]

Do we need to support it?

If not, we can change it to LogisticRegression and it will work fine.

_is_binary = is_binary(array)

if _is_binary:
_is_1d_array = len(array.shape) == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use switch case here for better readability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the readability but I don't see how a switch case would help here since there is 1 if else. I exported the inner if else to a method (_get_number_of_elements). what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's fine I guess. Maybe some short comments can also help.

@edublancas
Copy link
Contributor

In one of our examples we try to plot a roc curve using the DecisionTreeClassifier and as a result, y_score is in an invalid format

good catch. technically speaking, this is in the right format since these are valid scores (they are floats, and 0.0. and 1.0 are valid values). the "issue" here is the nature of the decision tree model, which tends to produce these types of scores.

I think let's change it to logistic regression so the scores are more meaningful

@yafimvo yafimvo requested a review from edublancas January 26, 2023 14:41
@idomic
Copy link
Contributor

idomic commented Jan 26, 2023

I think let's change it to logistic regression so the scores are more meaningful

In this specific case for roc or the whole guide? let's open a different issue for that?

@edublancas @yafimvo Also what else is missing here?

@yafimvo
Copy link
Contributor Author

yafimvo commented Jan 26, 2023

@idomic
I changed the example only for roc.

Ready for review


@image_comparison(baseline_images=["roc_add_roc"])
def test_roc_add_to_roc(roc_values):
fpr1, tpr1 = roc_values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? it seems like those values get overwritten?
I see similar stuff in other test cases as well

Copy link
Contributor

@idomic idomic Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can parameterize the inputs I think we should do it as the test seems pretty much the same for the roc add (might not be possible due to the image comparison)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This was from the previous code where roc tests were scattered in different places. I removed all roc resources from conftest.py to test_roc.py and parameterized the "short" values that are easy to use. The more complex values I kept as a fixture.

Copy link
Contributor

@idomic idomic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just added a comment on the tests

@idomic idomic merged commit 401e2fb into ploomber:master Jan 29, 2023
@idomic
Copy link
Contributor

idomic commented Jan 29, 2023

Nice job @yafimvo !

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiclass ROC curve
5 participants