-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: What's the 'correct' way to access all results of a CrawlJob? #41
Comments
Use the Search API to access Crawlbot produced datasets and get back actual usable objects as defined by the client. $search = $diffbot->search('author:"Miles Johnson" AND type:article');
$result = $search->call();
foreach ($result as $article) {
echo $article->getTitle();
} The API itself always returns the whole dataset when being asked for data, so there's no way to "stream" only new results, no, but I am planning to add it to the client in the foreseeable future (see #5). |
I looked at the SearchAPI, and looking at the Diffbot docs, searching with an empty string should return all results: 'Leave blank to return all objects in the collection(s).' (under query operators on https://www.diffbot.com/dev/docs/search/), however I'm not seeing the data being returned (I've looked at the URL directly and don't see the data there either, so perhaps the documentation is out of sync on their end? If this is correct, is loading/processing the raw json the only way of dealing with the whole dataset? |
It would appear this has changed, indeed. Let me check and get back to you. |
Confirmed, and docs have been updated. An empty search query will not return the full set. However, seeing as the collection will contain entities of various types, you won't be able to use them properly in a loop anyway. Consider: foreach ($result as $article) {
echo $article->getTitle();
} This would fail if some of the entities were custom, or products, or whatnot. Ergo, it would likely be best if you queried with at least a type ( |
Yes, querying by type makes sense, and as you say, ensures the results are an object type you're expecting. Is that a change you need to make in the codebase or something that's already supported? |
Already in there. Just pass "type:x" as the query where x is your desired type |
Working perfectly - thanks! |
From reading the docs, it looks like loading the json via the downloadUrl() method on the Crawl job is the only way to do it, however as that'll not give any getters/setters/objects (because it's processing the raw JSON data) it smells...wrong.
Is there a better way of doing this?
Related, as the crawl job updates as new pages are discovered, is there a way of just downloading the new dataset - data since the last query (so there's no reprocessing of data) - or is that left as an exercise for the reader?
Thanks!
The text was updated successfully, but these errors were encountered: