Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use LearnBase dataset interface in the DataLoader #1683

Closed
wants to merge 7 commits into from

Conversation

CarloLucibello
Copy link
Member

@CarloLucibello CarloLucibello commented Jul 29, 2021

Users can now define their own dataset types and as long as they define LearnBase.getobs and LearnBase.nobs the DataLoader will be able to handle them.

This PR also adopts dictionaries as one of the supported by default dataset types.

@CarloLucibello CarloLucibello changed the title use LearnBase dataset interface use LearnBase dataset interface in the DataLoader Jul 29, 2021
@CarloLucibello CarloLucibello requested a review from darsnack July 29, 2021 19:43
@CarloLucibello
Copy link
Member Author

where is github's CI? bah

@darsnack
Copy link
Member

darsnack commented Jul 29, 2021

Let me try pushing through JuliaML/MLDataPattern.jl#50 over the weekend and then we can also move some of these default definitions over to LearnBase.jl (or I could create MLBase.jl). Ideally these definitions belong not in Flux so that they get reuse across the ecosystem.

@CarloLucibello
Copy link
Member Author

It does feel like piracy indeed to have those definitions here. I'll add them to LearnBase,
so that you should be able to remove most of the content here
https://github.com/JuliaML/MLDataPattern.jl/blob/9ba23b65f1eb00b75628448f05d7e154ede2961d/src/container.jl
while working at your PR

_getobs(data::AbstractArray, i) = data[ntuple(i -> Colon(), Val(ndims(data) - 1))..., i]
_getobs(data::Union{Tuple, NamedTuple}, i) = map(Base.Fix2(_getobs, i), data)

Base.eltype(::DataLoader{D}) where D = D
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to remove this because it is actually wrong when batchsize=1. As a consequence though, when we collect the dataloader the get a Vector{Any}:

julia> using Flux.Data

julia> DataLoader(rand(4), batchsize=2) |> collect
2-element Vector{Any}:
 [0.5661901506001856, 0.28811712623170194]
 [0.9265719684408686, 0.05498394818719787]

I don't know how to fix this in a not overly complicated way. I think we will have to live with it.

Copy link
Member Author

@CarloLucibello CarloLucibello Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we can consider

Base.eltype(d::DataLoader) = typeof(first(d))

an acceptable solution. But I don't think we can guarantee this for generic datasets

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably this needs to be consistently done in LearnBase and MLDataPattern. Anything that implements iterate should specify IteratorEltype and consequently eltype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning that if we implement the LearnBase interfaces, then we should use MLDataPattern.eachbatch and MLDataPattern.shuffle. And those containers/iterators should be implementing eltype stuff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll have to take a look at that. In any case, since the current assumptions on master are incorrect, we have to remove eltype for the time being, and we can merge this as it is while we wait for MLDataPattern update

@DhairyaLGandhi
Copy link
Member

Seems like it is adding a dependency on LearnBase. Couldn't users who want to use LearnBase do so already with a loop?

@darsnack
Copy link
Member

LearnBase is a lightweight dependency. How would we be compatible with LearnBase interfaces without extending them for DataLoader?

@CarloLucibello
Copy link
Member Author

Seems like it is adding a dependency on LearnBase.

that's not a problem, LearnBase is an lean dependency

Couldn't users who want to use LearnBase do so already with a loop?

the point here is to create a consistent dataset api for the whole ML ecosystem and have the DataLoader support it

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Aug 8, 2021

This shouldn't be merged until MLDataPattern is made compatible with the new LearnBase version, otherwise FastAI.jl (that depends on MLDataPattern) won't be compatible with newest flux's releases. cc @darsnack

@DhairyaLGandhi
Copy link
Member

Since there is no hard limit on Fastai on which dataloaders to use, I don't see why we would want to make this a requirement here. Flux shouldn't assume such things so users can use what makes sense for their use case, or keep things simple and not force future work in Flux to be compatible with a particular design.

@CarloLucibello
Copy link
Member Author

I don't see your point. This PR is not imposing additional limitations. It is just giving the opportunity to people to define dataset types compatible with the DataLoader without having to hack into Flux's internals

@DhairyaLGandhi
Copy link
Member

Why would we need it to go via DataLoader anyway? The for loop is a simpler implementation for anyone wanting specific iterators.

@CarloLucibello
Copy link
Member Author

Closing as this is implemented in JuliaML/MLUtils.jl#22

@CarloLucibello CarloLucibello deleted the cl/learnbase branch April 7, 2022 07:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants