Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all LINKX datasets are available #4569

Open
OlegPlatonov opened this issue Apr 29, 2022 · 8 comments
Open

Not all LINKX datasets are available #4569

OlegPlatonov opened this issue Apr 29, 2022 · 8 comments

Comments

@OlegPlatonov
Copy link
Contributor

🚀 The feature, motivation and pitch

Hi! I've noticed that PyG now has datasets from the “Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods” paper, however, for some reason not all datasets proposed in the paper are provided. It would be great if all the other datasets from the paper (pokec, genius, wiki, etc.) were added. There are not many heterophilous graph datasets and it will be very useful to have all of them in one place.

Alternatives

No response

Additional context

No response

@Padarn
Copy link
Contributor

Padarn commented Apr 30, 2022

Hey @OlegPlatonov I've opened a PR here to address this #4570. Please take a look and let me know what you think. Happy to have your input especially on the features for the deezer-europe dataset.

@OlegPlatonov
Copy link
Contributor Author

Hi @Padarn! I'm afraid there is a slight misunderstanding. The repository for the “Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods” paper stores some datasets from prior works and these are the datasets you have added, however, they are already present in PyG (for example, here and here). What I've meant were the new datasets introduced in the paper, specifically pokec, arXiv-year, snap-patents, genius, twitch-gamers and wiki.

@Padarn
Copy link
Contributor

Padarn commented Apr 30, 2022 via email

@Padarn
Copy link
Contributor

Padarn commented May 1, 2022

Hey @OlegPlatonov - actually many of the datasets are are already in PyG. For example 'twitch-gamer's is available here.

However in the paper they say they have updated some of these datasets.

Most of these datasets have been used for evaluation of graph machine learning models in past
work; we make adjustments such as modifying node labels and adding node features that allow for
evaluation of GNNs in non-homophilous settings. We define node features for Pokec, genius, and
snap-patents, and we also define node labels for arXiv-year, snap-patents, and genius. Additionally,
we crawl and clean the large-scale wiki dataset — a new Wikipedia dataset where the task is to
predict page views, which is non-homophilous with respect to the graph of articles connected by
links between articles (see Appendix D.3). This wiki dataset has 1,925,342 nodes and 303,434,860
edges, so training and inference require scalable algorithms.

The only one totally new is the wiki dataset as far as I can tell.

I've updated the MR to just add 'genius' which did seem to be missing before.

@OlegPlatonov
Copy link
Contributor Author

Hey @Padarn - indeed most datasets from the paper are not entirely new, but unless I'm missing something, they are not available in PyG (at least not in the form used in the paper). I've just checked and could not find arXiv-year, snap-patents, genius and wiki datasets in PyG. pokec dataset is available here, but it does not contain node features that were defined in the paper. As for twitch dataset, the version in PyG is a collection of 6 different graphs, which is different from the single twitch-gamers graph used in the paper (the number of nodes does not match).

@Padarn
Copy link
Contributor

Padarn commented May 1, 2022 via email

@rusty1s
Copy link
Member

rusty1s commented May 2, 2022

Thanks @Padarn for your work on adding some of these datasets. I think adding the remaining one is definitely of interest to the community, especially in order to accelerate GNN research on heterophily graphs. Let's try to tackle this in follow-up PRs.

@Padarn
Copy link
Contributor

Padarn commented May 2, 2022

Yep aligned. I think adding wiki from the paper is the highest priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants