Resolve a few preprocessing/training bugs #394
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This branch addresses #327, #372, #392.
It makes one api change:
It also adds code to CnnPreprocessor that automatically drops duplicate index values from overlay_df. It is possible that a user would intend to supply duplicated values in overlay_df index (for instance, to change the representation of samples) and this change will silently delete those duplicates. The reason was to resolve issue #392: when there was a duplicated index in overlay_df, getting the labels for that index resulted in a 2d array instead of 1-d array in ImgOverlay action with update_labels=True. While one option would be to only select one of the rows, or randomly select a row, there isn't an obvious "correct" behavior when the rows have different values - so instead I decided to enforce that the index of overlay_df is unique. Since CnnPreprocessor hard codes many decisions about preprocessing, I felt comfortable adding a line that automatically drops rows with duplicated indices in an overlay_df passed to CnnPreprocessor.init().
#372 was a bug caused by the assertion on overlay_weight: simply needed to rewrite the input validation to allow either a float or a range like [0.1,0.8]
#327 was half resolved already. I added an error message for the type of IndexError in SafeDataset that should never actually happen: indexing past the end of self.df