-
-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use LearnBase dataset interface in the DataLoader #1683
Conversation
where is github's CI? bah |
Let me try pushing through JuliaML/MLDataPattern.jl#50 over the weekend and then we can also move some of these default definitions over to LearnBase.jl (or I could create MLBase.jl). Ideally these definitions belong not in Flux so that they get reuse across the ecosystem. |
It does feel like piracy indeed to have those definitions here. I'll add them to LearnBase, |
_getobs(data::AbstractArray, i) = data[ntuple(i -> Colon(), Val(ndims(data) - 1))..., i] | ||
_getobs(data::Union{Tuple, NamedTuple}, i) = map(Base.Fix2(_getobs, i), data) | ||
|
||
Base.eltype(::DataLoader{D}) where D = D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to remove this because it is actually wrong when batchsize=1. As a consequence though, when we collect the dataloader the get a Vector{Any}:
julia> using Flux.Data
julia> DataLoader(rand(4), batchsize=2) |> collect
2-element Vector{Any}:
[0.5661901506001856, 0.28811712623170194]
[0.9265719684408686, 0.05498394818719787]
I don't know how to fix this in a not overly complicated way. I think we will have to live with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we can consider
Base.eltype(d::DataLoader) = typeof(first(d))
an acceptable solution. But I don't think we can guarantee this for generic datasets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably this needs to be consistently done in LearnBase and MLDataPattern. Anything that implements iterate
should specify IteratorEltype
and consequently eltype
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meaning that if we implement the LearnBase interfaces, then we should use MLDataPattern.eachbatch
and MLDataPattern.shuffle
. And those containers/iterators should be implementing eltype
stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll have to take a look at that. In any case, since the current assumptions on master are incorrect, we have to remove eltype
for the time being, and we can merge this as it is while we wait for MLDataPattern update
Seems like it is adding a dependency on LearnBase. Couldn't users who want to use LearnBase do so already with a loop? |
LearnBase is a lightweight dependency. How would we be compatible with LearnBase interfaces without extending them for DataLoader? |
that's not a problem, LearnBase is an lean dependency
the point here is to create a consistent dataset api for the whole ML ecosystem and have the DataLoader support it |
ff55748
to
c83ec79
Compare
This shouldn't be merged until MLDataPattern is made compatible with the new LearnBase version, otherwise FastAI.jl (that depends on MLDataPattern) won't be compatible with newest flux's releases. cc @darsnack |
Since there is no hard limit on Fastai on which dataloaders to use, I don't see why we would want to make this a requirement here. Flux shouldn't assume such things so users can use what makes sense for their use case, or keep things simple and not force future work in Flux to be compatible with a particular design. |
I don't see your point. This PR is not imposing additional limitations. It is just giving the opportunity to people to define dataset types compatible with the DataLoader without having to hack into Flux's internals |
Why would we need it to go via DataLoader anyway? The for loop is a simpler implementation for anyone wanting specific iterators. |
Closing as this is implemented in JuliaML/MLUtils.jl#22 |
Users can now define their own dataset types and as long as they define
LearnBase.getobs
andLearnBase.nobs
the DataLoader will be able to handle them.This PR also adopts dictionaries as one of the supported by default dataset types.