Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: What's the 'correct' way to access all results of a CrawlJob? #41

Closed
jonathantullett opened this issue Mar 8, 2016 · 7 comments

Comments

@jonathantullett
Copy link

From reading the docs, it looks like loading the json via the downloadUrl() method on the Crawl job is the only way to do it, however as that'll not give any getters/setters/objects (because it's processing the raw JSON data) it smells...wrong.

Is there a better way of doing this?

Related, as the crawl job updates as new pages are discovered, is there a way of just downloading the new dataset - data since the last query (so there's no reprocessing of data) - or is that left as an exercise for the reader?

Thanks!

@Swader
Copy link
Owner

Swader commented Mar 8, 2016

Use the Search API to access Crawlbot produced datasets and get back actual usable objects as defined by the client.

$search = $diffbot->search('author:"Miles Johnson" AND type:article');
$result = $search->call();

foreach ($result as $article) {
    echo $article->getTitle();
}

The API itself always returns the whole dataset when being asked for data, so there's no way to "stream" only new results, no, but I am planning to add it to the client in the foreseeable future (see #5).

@jonathantullett
Copy link
Author

I looked at the SearchAPI, and looking at the Diffbot docs, searching with an empty string should return all results: 'Leave blank to return all objects in the collection(s).' (under query operators on https://www.diffbot.com/dev/docs/search/), however I'm not seeing the data being returned (I've looked at the URL directly and don't see the data there either, so perhaps the documentation is out of sync on their end?

If this is correct, is loading/processing the raw json the only way of dealing with the whole dataset?

@Swader
Copy link
Owner

Swader commented Mar 8, 2016

It would appear this has changed, indeed. Let me check and get back to you.

@Swader
Copy link
Owner

Swader commented Mar 8, 2016

Confirmed, and docs have been updated. An empty search query will not return the full set.

However, seeing as the collection will contain entities of various types, you won't be able to use them properly in a loop anyway. Consider:

foreach ($result as $article) {
    echo $article->getTitle();
}

This would fail if some of the entities were custom, or products, or whatnot. Ergo, it would likely be best if you queried with at least a type (type=article). If all the entities in the resultset are of the same type, even better - you get all your entities, AND you're sure they're exactly what you expect. Would this be acceptable?

@jonathantullett
Copy link
Author

Yes, querying by type makes sense, and as you say, ensures the results are an object type you're expecting. Is that a change you need to make in the codebase or something that's already supported?

@Swader
Copy link
Owner

Swader commented Mar 9, 2016

Already in there. Just pass "type:x" as the query where x is your desired type

@jonathantullett
Copy link
Author

Working perfectly - thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants