Not all LINKX datasets are available #4569

OlegPlatonov · 2022-04-29T18:16:45Z

🚀 The feature, motivation and pitch

Hi! I've noticed that PyG now has datasets from the “Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods” paper, however, for some reason not all datasets proposed in the paper are provided. It would be great if all the other datasets from the paper (pokec, genius, wiki, etc.) were added. There are not many heterophilous graph datasets and it will be very useful to have all of them in one place.

Alternatives

No response

Additional context

No response

Padarn · 2022-04-30T00:51:05Z

Hey @OlegPlatonov I've opened a PR here to address this #4570. Please take a look and let me know what you think. Happy to have your input especially on the features for the deezer-europe dataset.

OlegPlatonov · 2022-04-30T13:21:19Z

Hi @Padarn! I'm afraid there is a slight misunderstanding. The repository for the “Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods” paper stores some datasets from prior works and these are the datasets you have added, however, they are already present in PyG (for example, here and here). What I've meant were the new datasets introduced in the paper, specifically pokec, arXiv-year, snap-patents, genius, twitch-gamers and wiki.

Padarn · 2022-04-30T13:32:34Z

Oh I see, thanks for the clarification! I didn't look carefully enough at what already exists. I'll update tomorrow.

…

On Sat, 30 Apr 2022, 9:21 pm OlegPlatonov, ***@***.***> wrote: Hi @Padarn <https://github.com/Padarn>! I'm afraid there is a slight misunderstanding. The repository for the “Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods” <https://arxiv.org/abs/2110.14446> paper stores some datasets from prior works and these are the datasets you have added, however, they are already present in PyG (for example, here <https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.WikipediaNetwork> and here <https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.DeezerEurope>). What I've meant were the new datasets introduced in the paper, specifically pokec, arXiv-year, snap-patents, genius, twitch-gamers and wiki. — Reply to this email directly, view it on GitHub <#4569 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGRPN2RLRWDHJJWAVMP7L3VHUXVTANCNFSM5UW3LCSQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- By communicating with Grab Inc and/or its subsidiaries, associate companies and jointly controlled entities (“Grab Group”), you are deemed to have consented to the processing of your personal data as set out in the Privacy Notice which can be viewed at https://grab.com/privacy/ <https://grab.com/privacy/> This email contains confidential information and is only for the intended recipient(s). If you are not the intended recipient(s), please do not disseminate, distribute or copy this email Please notify Grab Group immediately if you have received this by mistake and delete this email from your system. Email transmission cannot be guaranteed to be secure or error-free as any information therein could be intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do not accept liability for any errors or omissions in the contents of this email arises as a result of email transmission. All intellectual property rights in this email and attachments therein shall remain vested in Grab Group, unless otherwise provided by law.

Padarn · 2022-05-01T01:50:14Z

Hey @OlegPlatonov - actually many of the datasets are are already in PyG. For example 'twitch-gamer's is available here.

However in the paper they say they have updated some of these datasets.

Most of these datasets have been used for evaluation of graph machine learning models in past
work; we make adjustments such as modifying node labels and adding node features that allow for
evaluation of GNNs in non-homophilous settings. We define node features for Pokec, genius, and
snap-patents, and we also define node labels for arXiv-year, snap-patents, and genius. Additionally,
we crawl and clean the large-scale wiki dataset — a new Wikipedia dataset where the task is to
predict page views, which is non-homophilous with respect to the graph of articles connected by
links between articles (see Appendix D.3). This wiki dataset has 1,925,342 nodes and 303,434,860
edges, so training and inference require scalable algorithms.

The only one totally new is the wiki dataset as far as I can tell.

I've updated the MR to just add 'genius' which did seem to be missing before.

OlegPlatonov · 2022-05-01T12:46:31Z

Hey @Padarn - indeed most datasets from the paper are not entirely new, but unless I'm missing something, they are not available in PyG (at least not in the form used in the paper). I've just checked and could not find arXiv-year, snap-patents, genius and wiki datasets in PyG. pokec dataset is available here, but it does not contain node features that were defined in the paper. As for twitch dataset, the version in PyG is a collection of 6 different graphs, which is different from the single twitch-gamers graph used in the paper (the number of nodes does not match).

Padarn · 2022-05-01T13:39:29Z

Yeah understand, they're not all there and some are updated in the paper, it just wasn't immediately clear what the best thing to do was with updated datasets that we already have. Maybe it's easier to tackle them separately across a few PRs? I have one open for genius now, maybe we could prioritize the others?

rusty1s · 2022-05-02T07:25:39Z

Thanks @Padarn for your work on adding some of these datasets. I think adding the remaining one is definitely of interest to the community, especially in order to accelerate GNN research on heterophily graphs. Let's try to tackle this in follow-up PRs.

Padarn · 2022-05-02T10:04:40Z

Yep aligned. I think adding wiki from the paper is the highest priority.

OlegPlatonov added the feature label Apr 29, 2022

Padarn mentioned this issue Apr 30, 2022

Add missing Genius dataset #4570

Merged

rusty1s added 0 - Priority P0 dataset labels May 2, 2022

rusty1s assigned Padarn, rusty1s and OlegPlatonov May 2, 2022

Padarn mentioned this issue May 7, 2022

Adding 'wiki' dataset to LINKXDataset #4600

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not all LINKX datasets are available #4569

Not all LINKX datasets are available #4569

OlegPlatonov commented Apr 29, 2022

Padarn commented Apr 30, 2022

OlegPlatonov commented Apr 30, 2022

Padarn commented Apr 30, 2022 via email

Padarn commented May 1, 2022

OlegPlatonov commented May 1, 2022

Padarn commented May 1, 2022 via email •

edited

Loading

rusty1s commented May 2, 2022

Padarn commented May 2, 2022

Not all LINKX datasets are available #4569

Not all LINKX datasets are available #4569

Comments

OlegPlatonov commented Apr 29, 2022

🚀 The feature, motivation and pitch

Alternatives

Additional context

Padarn commented Apr 30, 2022

OlegPlatonov commented Apr 30, 2022

Padarn commented Apr 30, 2022 via email

Padarn commented May 1, 2022

OlegPlatonov commented May 1, 2022

Padarn commented May 1, 2022 via email • edited Loading

rusty1s commented May 2, 2022

Padarn commented May 2, 2022

Padarn commented May 1, 2022 via email •

edited

Loading