You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: CHANGELOG.md
+23
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,29 @@
1
1
#Changelog
2
2
All notable changes will be documented in this file
3
3
4
+
## 0.3 - May 17th, 2015
5
+
6
+
### Internal changes
7
+
8
+
-[Internal] DiffbotAware trait now responsible for registering Diffbot parent in children
9
+
-[BC Break, Internal] PHP 5.6 is now required (`...` operator)
10
+
-[Internal] Updated all API calls to HTTPS
11
+
12
+
### Features
13
+
14
+
-[Feature] Implemented Crawlbot API, added usage example to README
15
+
-[Feature] Added `Job` abstract entity with `JobCrawl` and `JobBulk` derivations. A `Job` is either a [Bulk API job](https://www.diffbot.com/dev/docs/bulk) or a [Crawl job](https://www.diffbot.com/dev/docs/crawl). A collection of jobs is the result of a Crawl or Bulk API call. When job name is provided, a max of one item is present in the collection.
Copy file name to clipboardexpand all lines: README.md
+102-3
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@ Right now it only supports Analyze, Product, Image, Discussion and Article APIs,
10
10
11
11
## Requirements
12
12
13
-
Minimum PHP 5.4 because Guzzle needs it.
13
+
Minimum PHP 5.6 is required. When installed via Composer, the library will pull in Guzzle 5 as well, so it's recommended you have cURL installed, but not required.
14
14
15
15
## Install
16
16
@@ -59,7 +59,7 @@ Currently available [*automatic*](http://www.diffbot.com/products/automatic/) AP
59
59
-[discussion](http://www.diffbot.com/products/automatic/discussion/) (fetches discussion / review / comment threads - can be embedded in the Product or Article return data, too, if those contain any comments or discussions)
60
60
-[analyze](http://www.diffbot.com/products/automatic/analyze/) (combines all the above in that it automatically determines the right API for the URL and applies it)
61
61
62
-
Video is coming soon.
62
+
Video is coming soon. See below for instructions on Crawlbot, Search and Bulk API.
63
63
64
64
There is also a [Custom API](http://www.diffbot.com/products/custom/) like [this one](http://www.sitepoint.com/analyze-sitepoint-author-portfolios-diffbot/) - unless otherwise configured, they return instances of the Wildcard entity)
65
65
@@ -200,7 +200,7 @@ Used just like all others. There are only two differences:
200
200
The following is a usage example of my own custom API for author profiles at SitePoint:
@@ -213,6 +213,105 @@ foreach ($return as $wildcard) {
213
213
214
214
Of course, you can easily extend the basic Custom API class and make your own, as well as add your own Entities that perfectly correspond to the returned data. This will all be covered in a tutorial in the near future.
215
215
216
+
## Crawlbot and Bulk API
217
+
218
+
Basic Crawlbot support has been added to the library.
219
+
To find out more about Crawlbot and what, how and why it does what it does, see [here](https://www.diffbot.com/dev/docs/crawl/).
220
+
I also recommend reading the [Crawlbot API docs](https://www.diffbot.com/dev/docs/crawl/api.jsp) and the [Crawlbot support topics](http://support.diffbot.com/topics/crawlbot/) just so you can dive right in without being too confused by the code below.
221
+
222
+
In a nutshell, the Crawlbot crawls a set of seed URLs for links (even if a subdomain is passed to it as seed URL, it still looks through the entire main domain and all other subdomains it can find) and then processes all the pages it can find using the API you define (or opting for Analyze API by default).
223
+
224
+
### List of all crawl / bulk jobs
225
+
226
+
A joint list of all your crawl / bulk jobs can be fetched via:
227
+
228
+
```
229
+
$diffbot = new Diffbot('my_token');
230
+
$jobs = $diffbot->crawl()->call();
231
+
```
232
+
233
+
This returns a collection of all crawl and bulk jobs. Each type is represented by its own class: `JobCrawl` and `JobBulk`. It's important to note that Jobs only contain the information about the job - not the data. To get the data of a job, use the `downloadUrl` method to get the URL to the dataset:
234
+
235
+
```
236
+
$url = $job->downloadUrl("json");
237
+
```
238
+
239
+
### Crawl jobs: Creating a Crawl Job
240
+
241
+
See inline comments for step by step explanation
242
+
243
+
```
244
+
// Create new diffbot as usual
245
+
$diffbot = new Diffbot('my_token');
246
+
247
+
// The crawlbot needs to be told which API to use to process crawled pages. This is optional - if omitted, it will be told to use the Analyze API with mode set to auto.
248
+
// The "crawl" url is a flag to tell APIs to prepare for consumption with Crawlbot, letting them know they won't be used directly.
// Set seeds - seeds are URLs to crawl. By default, passing a subdomain into the crawl will also crawl other subdomains on main domain, including www.
256
+
$crawl->setSeeds(['http://sitepoint.com']);
257
+
258
+
// Call as usual - an EntityIterator collection of results is returned. In the case of a job's creation, only one job entity will always be returned.
259
+
$job = $crawl->call();
260
+
261
+
// See JobCrawl class to find out which getters are available
262
+
dump($job->getDownloadUrl("json")); // outputs download URL to JSON dataset of the job's result
263
+
```
264
+
265
+
### Crawl jobs: Inspecting an existing Crawl Job
266
+
267
+
To get data about a job (this will be the data it was configured with - its flags - and not the results!), use the exact same approach as if creating a new one, only without the API and seeds:
268
+
269
+
```
270
+
$diffbot = new Diffbot('my_token');
271
+
272
+
$crawl = $diffbot->crawl('sitepoint_01');
273
+
274
+
$job = $crawl->call();
275
+
276
+
dump($job->getDownloadUrl("json")); // outputs download URL to JSON dataset of the job's result
277
+
```
278
+
279
+
### Crawl jobs: Modifying an existing Crawl Job
280
+
281
+
While there is no way to alter a crawl job's configuration post creation, you can still do some operations on it.
282
+
283
+
Provided you fetched a `$crawl` instance as in the above section on inspecting, you can do the following:
284
+
285
+
```
286
+
// Force start of a new crawl round manually
287
+
$crawl->roundStart();
288
+
289
+
// Pause or unpause (0) a job
290
+
$crawl->pause();
291
+
$crawl->pause(0)
292
+
293
+
// Restart removes all crawled data but keeps the job (and settings)
294
+
$crawl->restart();
295
+
296
+
// Delete a job and all related data
297
+
$crawl->delete();
298
+
```
299
+
300
+
Note that it is not necessary to issue a `call()` after these methods.
301
+
302
+
If you would like to extract the generated API call URL for these instant-call actions, pass in the parameter `false`, like so:
303
+
304
+
```
305
+
$crawl->delete(false);
306
+
```
307
+
308
+
You can then save the URL for your convenience and call `call` when ready to execute (if at all).
309
+
310
+
```
311
+
$url = $crawl->buildUrl();
312
+
$url->call();
313
+
```
314
+
216
315
## Testing
217
316
218
317
Just run PHPUnit in the root folder of the cloned project.
0 commit comments