Skip to content
This repository was archived by the owner on May 21, 2022. It is now read-only.

Updated to reflect new LearnBase interface for getobs and ObsDim #50

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

darsnack
Copy link
Member

This is a WIP in response to JuliaML/LearnBase.jl#44. The changes to the LearnBase.jl interface reduce the MLDataPattern.jl codebase significantly. Most notably, we are able to avoid a lot of the routing logic for getobs.

Since there are other JuliaML packages that will need similar updates, I want to summarize the highlights of what this transition entails:

  • Most of it is removing convert(LearnBase.ObsDimension, obsdim) statements; obsdim now has no type restrictions.
  • For a lot of array code, generated functions were used to create views based on the obsdim type. This is no longer necessary, and you can directly call selectdim(A, idx, obsdim) from Base. In some crude testing, I found no performance regressions from doing this. selectdim essentially does the views + colon tricks that the generated functions used to do.
  • Make sure to remove definitions for getobs(x), getobs(x, obsdim), and nobs(x) from the code. The only interface functions to implement are getobs(x, idx, obsdim) and nobs(x, obsdim) (note: this may change slightly based on the LearnBase.jl PR but deleting the above methods is still valid.
  • If most of the JuliaML packages are structured similar to MLDataPattern.jl, then the bulk of the changes will be in one or two files.

There are still some parts I haven't addressed like gettarget and BatchView. There are no technical challenges here; I just haven't gotten to them. We'll want to get those changes in too before merging.

@johnnychen94
Copy link
Member

johnnychen94 commented Oct 30, 2020

You could temporarily check in a Manifest.toml file to track on JuliaML/LearnBase.jl#44 so as to put CI back to the track. This PR should be approved first with green checks in CI before we merge and tag a new release for JuliaML/LearnBase.jl#44

@darsnack
Copy link
Member Author

darsnack commented Nov 3, 2020

So one issue that I've run into with MLDataPattern.jl. There are some cases (e.g. DataSubset) where the obsdim is stored as part of the struct. In these cases, I (speaking from a developer's perspective) want to just define

LearnBase.getobs(data::DataSubset, idx) =
    getobs(subset.data, _view(subset.indices, idx); obsdim = subset.obsdim)

because it does not make sense that obsdim should be anything other than subset.obsdim. But since technically a user can call getobs(subset, idx; obsdim = 2) (i.e. set obsdim kwarg to anything), I need to insert a check:

function LearnBase.getobs(subset::DataSubset, idx; obsdim = default_obsdim(subset))
    @assert obsdim === subset.obsdim
    return getobs(subset.data, _view(subset.indices, idx); obsdim = obsdim)
end

Or I can just ignore obsdim kwarg and return getobs(subset.data, _view(subset.indices, idx); obsdim = subset.obsdim).

Both options seem messy to me. Since the default_obsdim interface exists, I don't think there is any reason to be storing obsdim as part of structs. Instead we can just do something like:

LearnBase.default_obsdim(subset::DataSubset) = default_obsdim(subset.data)

LearnBase.getobs(subset::DataSubset, idx; obsdim = default_obsdim(subset)) =
    getobs(subset.data, _view(subset.indices, idx); obsdim = obsdim)

There are two reasons for this:

  1. In most cases, the user will never bother specifying a different obsdim than default_obsdim(subset.data).
  2. If subset.data is an array with the observation dimension in the 2nd dim for some reason, the user can override the default by setting obsdim = 2. This is the reason why I think that storing the obsdim does not make much sense, since if it not the default, then the user specifies it anyways.

UPDATE:

I see now that perhaps the reason for storing obsdim was to facilitate the bounds check in the inner constructor:

function DataSubset{T,I,O}(data::T, indices::I,obsdim::O) where {T,I,O}
    if T <: Tuple
       error("inner constructor should not be called using a Tuple")
    end
    1 <= minimum(indices) || throw(BoundsError(data, indices))
    maximum(indices) <= nobs(data, obsdim) || throw(BoundsError(data, indices))
    new{T,I,O}(data, indices, obsdim)
end

While this kind of check is a nice convenience, I think it just increases the code complexity of MLDataPattern.jl, and it is not necessary. If indices is out of bounds for data, then a bounds error will be thrown on the actual call to getindex(data, indices).

@johnnychen94
Copy link
Member

johnnychen94 commented Nov 6, 2020

Sorry for the delay. Not using MLDataPattern for quite a long time, I'm not as familiar with it as you are so I don't know which is better. Generally, we want to make things simpler during this refactoring.

If making it convenient for users makes it complex and thus harder to maintain for developers, I personally prefer the complex version just to make it more intuitive to use.

the reason for storing obsdim was to facilitate the bounds check in the inner constructor:

Given that usage like DataSubset(X', 21:100, ObsDim.First()) is now unavailable, I guess we could remove that if this becomes the only usage case.

@racinmat
Copy link
Contributor

Hi, any update on this? When will [email protected] be supported? Can I help with this somehow?

@darsnack
Copy link
Member Author

Yes, this PR effort is something I haven't had time to push through. I will put it together in a stage this week where someone else can take the torch (with an explanation of what needs to be done).

@darsnack
Copy link
Member Author

For the views like FoldsView, BatchView, and SlidingWindow, etc., I think we can just drop the abstract types. There is very little code that dispatches on the abstract types, and we can easily use @eval with a loop to define repeated code for the few concrete types that subtype e.g. SlidingWindow. Does that sound good @racinmat?

@racinmat
Copy link
Contributor

Sounds good. Although I consider eval with loop for defining them a bit of abuse and hard to read code and I would rathere have some lightweight supertype which would be part of MLDataPattern and we would define the behavior needed for FoldsView, BatchView, and SlidingWindow etc. there. That sounds more julianic to me.

@darsnack
Copy link
Member Author

One of the things I would like to determine is how much of the non-getobs/nobs code can be removed (or at least is there value in keeping it). If there is a need to keep a lot of the shared interfaces, then we can define a super type. Instead of inheriting from DataView, it would inherit from AbstractDataContainer directly.

@racinmat
Copy link
Contributor

So it should be
ObsIterator -> AbstractDataIterator
BatchIterator -> AbstractDataIterator
AbstractObsView -> AbstractDataContainer
AbstractBatchView -> AbstractDataContainer
DataView ->AbstractDataContainer
etc.?

@darsnack
Copy link
Member Author

Yeah basically if there is an abstract type like SlidingWindow that actually gets used (to define some shared code), then keep it. If there is an abstract type that is just another layer between AbstractDataContainer/AbstractDataIterator with no real use, then remove it and its subtypes can inherit directly from AbstractDataContainer/AbstractDataIterator.

@CarloLucibello
Copy link
Member

Can someone rebase?

@CarloLucibello
Copy link
Member

Hi @rancimat, what is the status of this?

@darsnack
Copy link
Member Author

@racinmat had to hand back off to me due to work commitments. I have been busy all semester, but I will work on this during break.

@racinmat
Copy link
Contributor

Yes, in the end I had less time for this, so I did what I could, but I won't be able to finish this, sorry.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants