Question: What's the 'correct' way to access all results of a CrawlJob? #41

jonathantullett · 2016-03-08T12:23:13Z

From reading the docs, it looks like loading the json via the downloadUrl() method on the Crawl job is the only way to do it, however as that'll not give any getters/setters/objects (because it's processing the raw JSON data) it smells...wrong.

Is there a better way of doing this?

Related, as the crawl job updates as new pages are discovered, is there a way of just downloading the new dataset - data since the last query (so there's no reprocessing of data) - or is that left as an exercise for the reader?

Thanks!

Swader · 2016-03-08T12:28:23Z

Use the Search API to access Crawlbot produced datasets and get back actual usable objects as defined by the client.

$search = $diffbot->search('author:"Miles Johnson" AND type:article');
$result = $search->call();

foreach ($result as $article) {
    echo $article->getTitle();
}

The API itself always returns the whole dataset when being asked for data, so there's no way to "stream" only new results, no, but I am planning to add it to the client in the foreseeable future (see #5).

jonathantullett · 2016-03-08T12:32:20Z

I looked at the SearchAPI, and looking at the Diffbot docs, searching with an empty string should return all results: 'Leave blank to return all objects in the collection(s).' (under query operators on https://www.diffbot.com/dev/docs/search/), however I'm not seeing the data being returned (I've looked at the URL directly and don't see the data there either, so perhaps the documentation is out of sync on their end?

If this is correct, is loading/processing the raw json the only way of dealing with the whole dataset?

Swader · 2016-03-08T18:36:22Z

It would appear this has changed, indeed. Let me check and get back to you.

Swader · 2016-03-08T18:47:34Z

Confirmed, and docs have been updated. An empty search query will not return the full set.

However, seeing as the collection will contain entities of various types, you won't be able to use them properly in a loop anyway. Consider:

foreach ($result as $article) {
    echo $article->getTitle();
}

This would fail if some of the entities were custom, or products, or whatnot. Ergo, it would likely be best if you queried with at least a type (type=article). If all the entities in the resultset are of the same type, even better - you get all your entities, AND you're sure they're exactly what you expect. Would this be acceptable?

jonathantullett · 2016-03-08T21:19:47Z

Yes, querying by type makes sense, and as you say, ensures the results are an object type you're expecting. Is that a change you need to make in the codebase or something that's already supported?

Swader · 2016-03-09T05:04:29Z

Already in there. Just pass "type:x" as the query where x is your desired type

jonathantullett · 2016-03-10T12:09:29Z

Working perfectly - thanks!

jonathantullett closed this as completed Mar 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: What's the 'correct' way to access all results of a CrawlJob? #41

Question: What's the 'correct' way to access all results of a CrawlJob? #41

jonathantullett commented Mar 8, 2016

Swader commented Mar 8, 2016

jonathantullett commented Mar 8, 2016

Swader commented Mar 8, 2016

Swader commented Mar 8, 2016

jonathantullett commented Mar 8, 2016

Swader commented Mar 9, 2016

jonathantullett commented Mar 10, 2016

Question: What's the 'correct' way to access all results of a CrawlJob? #41

Question: What's the 'correct' way to access all results of a CrawlJob? #41

Comments

jonathantullett commented Mar 8, 2016

Swader commented Mar 8, 2016

jonathantullett commented Mar 8, 2016

Swader commented Mar 8, 2016

Swader commented Mar 8, 2016

jonathantullett commented Mar 8, 2016

Swader commented Mar 9, 2016

jonathantullett commented Mar 10, 2016