Skip to content

Commit 45cb6db

Browse files
committed
Implemented Crawlbot, this closes #4 and closes #1
1 parent 4b28488 commit 45cb6db

40 files changed

+2767
-88
lines changed

.scrutinizer.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -18,4 +18,4 @@ checks:
1818
tools:
1919
external_code_coverage:
2020
timeout: 600
21-
runs: 3
21+
runs: 1

.travis.yml

-2
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
11
language: php
22

33
php:
4-
- 5.4
5-
- 5.5
64
- 5.6
75
- 7.0
86
- hhvm

CHANGELOG.md

+23
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,29 @@
11
#Changelog
22
All notable changes will be documented in this file
33

4+
## 0.3 - May 17th, 2015
5+
6+
### Internal changes
7+
8+
- [Internal] DiffbotAware trait now responsible for registering Diffbot parent in children
9+
- [BC Break, Internal] PHP 5.6 is now required (`...` operator)
10+
- [Internal] Updated all API calls to HTTPS
11+
12+
### Features
13+
14+
- [Feature] Implemented Crawlbot API, added usage example to README
15+
- [Feature] Added `Job` abstract entity with `JobCrawl` and `JobBulk` derivations. A `Job` is either a [Bulk API job](https://www.diffbot.com/dev/docs/bulk) or a [Crawl job](https://www.diffbot.com/dev/docs/crawl). A collection of jobs is the result of a Crawl or Bulk API call. When job name is provided, a max of one item is present in the collection.
16+
17+
### Bugs
18+
19+
- [Bug] Fixed [#1](https://github.com/Swader/diffbot-php-client/issues/1)
20+
21+
### Meta
22+
23+
- [Repository] Added TODOs as issues in repo, linked to relevant ones in [TODO file](TODO.md).
24+
- [CI] Stopped testing for 5.4 and 5.5, updated Travis and Scrutinizer file to take this into account
25+
- [Tests] Fully tested Crawlbot implementation
26+
427
## 0.2 - May 2nd, 2015
528

629
- added Discussion API

README.md

+102-3
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Right now it only supports Analyze, Product, Image, Discussion and Article APIs,
1010

1111
## Requirements
1212

13-
Minimum PHP 5.4 because Guzzle needs it.
13+
Minimum PHP 5.6 is required. When installed via Composer, the library will pull in Guzzle 5 as well, so it's recommended you have cURL installed, but not required.
1414

1515
## Install
1616

@@ -59,7 +59,7 @@ Currently available [*automatic*](http://www.diffbot.com/products/automatic/) AP
5959
- [discussion](http://www.diffbot.com/products/automatic/discussion/) (fetches discussion / review / comment threads - can be embedded in the Product or Article return data, too, if those contain any comments or discussions)
6060
- [analyze](http://www.diffbot.com/products/automatic/analyze/) (combines all the above in that it automatically determines the right API for the URL and applies it)
6161

62-
Video is coming soon.
62+
Video is coming soon. See below for instructions on Crawlbot, Search and Bulk API.
6363

6464
There is also a [Custom API](http://www.diffbot.com/products/custom/) like [this one](http://www.sitepoint.com/analyze-sitepoint-author-portfolios-diffbot/) - unless otherwise configured, they return instances of the Wildcard entity)
6565

@@ -200,7 +200,7 @@ Used just like all others. There are only two differences:
200200
The following is a usage example of my own custom API for author profiles at SitePoint:
201201

202202
```php
203-
$diffbot = new Diffbot('brunoskvorc');
203+
$diffbot = new Diffbot('my_token');
204204
$customApi = $diffbot->createCustomAPI('http://sitepoint.com/author/bskvorc', 'authorFolioNew');
205205

206206
$return = $customApi->call();
@@ -213,6 +213,105 @@ foreach ($return as $wildcard) {
213213

214214
Of course, you can easily extend the basic Custom API class and make your own, as well as add your own Entities that perfectly correspond to the returned data. This will all be covered in a tutorial in the near future.
215215

216+
## Crawlbot and Bulk API
217+
218+
Basic Crawlbot support has been added to the library.
219+
To find out more about Crawlbot and what, how and why it does what it does, see [here](https://www.diffbot.com/dev/docs/crawl/).
220+
I also recommend reading the [Crawlbot API docs](https://www.diffbot.com/dev/docs/crawl/api.jsp) and the [Crawlbot support topics](http://support.diffbot.com/topics/crawlbot/) just so you can dive right in without being too confused by the code below.
221+
222+
In a nutshell, the Crawlbot crawls a set of seed URLs for links (even if a subdomain is passed to it as seed URL, it still looks through the entire main domain and all other subdomains it can find) and then processes all the pages it can find using the API you define (or opting for Analyze API by default).
223+
224+
### List of all crawl / bulk jobs
225+
226+
A joint list of all your crawl / bulk jobs can be fetched via:
227+
228+
```
229+
$diffbot = new Diffbot('my_token');
230+
$jobs = $diffbot->crawl()->call();
231+
```
232+
233+
This returns a collection of all crawl and bulk jobs. Each type is represented by its own class: `JobCrawl` and `JobBulk`. It's important to note that Jobs only contain the information about the job - not the data. To get the data of a job, use the `downloadUrl` method to get the URL to the dataset:
234+
235+
```
236+
$url = $job->downloadUrl("json");
237+
```
238+
239+
### Crawl jobs: Creating a Crawl Job
240+
241+
See inline comments for step by step explanation
242+
243+
```
244+
// Create new diffbot as usual
245+
$diffbot = new Diffbot('my_token');
246+
247+
// The crawlbot needs to be told which API to use to process crawled pages. This is optional - if omitted, it will be told to use the Analyze API with mode set to auto.
248+
// The "crawl" url is a flag to tell APIs to prepare for consumption with Crawlbot, letting them know they won't be used directly.
249+
$url = 'crawl';
250+
$articleApi = $diffbot->createArticleAPI($url)->setDiscussion(false);
251+
252+
// Make a new crawl job. Optionally, pass in API instance
253+
$crawl = $diffbot->crawl('sitepoint_01', $articleApi);
254+
255+
// Set seeds - seeds are URLs to crawl. By default, passing a subdomain into the crawl will also crawl other subdomains on main domain, including www.
256+
$crawl->setSeeds(['http://sitepoint.com']);
257+
258+
// Call as usual - an EntityIterator collection of results is returned. In the case of a job's creation, only one job entity will always be returned.
259+
$job = $crawl->call();
260+
261+
// See JobCrawl class to find out which getters are available
262+
dump($job->getDownloadUrl("json")); // outputs download URL to JSON dataset of the job's result
263+
```
264+
265+
### Crawl jobs: Inspecting an existing Crawl Job
266+
267+
To get data about a job (this will be the data it was configured with - its flags - and not the results!), use the exact same approach as if creating a new one, only without the API and seeds:
268+
269+
```
270+
$diffbot = new Diffbot('my_token');
271+
272+
$crawl = $diffbot->crawl('sitepoint_01');
273+
274+
$job = $crawl->call();
275+
276+
dump($job->getDownloadUrl("json")); // outputs download URL to JSON dataset of the job's result
277+
```
278+
279+
### Crawl jobs: Modifying an existing Crawl Job
280+
281+
While there is no way to alter a crawl job's configuration post creation, you can still do some operations on it.
282+
283+
Provided you fetched a `$crawl` instance as in the above section on inspecting, you can do the following:
284+
285+
```
286+
// Force start of a new crawl round manually
287+
$crawl->roundStart();
288+
289+
// Pause or unpause (0) a job
290+
$crawl->pause();
291+
$crawl->pause(0)
292+
293+
// Restart removes all crawled data but keeps the job (and settings)
294+
$crawl->restart();
295+
296+
// Delete a job and all related data
297+
$crawl->delete();
298+
```
299+
300+
Note that it is not necessary to issue a `call()` after these methods.
301+
302+
If you would like to extract the generated API call URL for these instant-call actions, pass in the parameter `false`, like so:
303+
304+
```
305+
$crawl->delete(false);
306+
```
307+
308+
You can then save the URL for your convenience and call `call` when ready to execute (if at all).
309+
310+
```
311+
$url = $crawl->buildUrl();
312+
$url->call();
313+
```
314+
216315
## Testing
217316

218317
Just run PHPUnit in the root folder of the cloned project.

TODO.md

+9-4
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,21 @@ Active todos, ordered by priority
44

55
## High
66

7-
- implement Crawlbot
8-
- implement Search API
7+
- [implement Bulk Processing Support](https://github.com/Swader/diffbot-php-client/issues/3)
8+
- [implement Search API](https://github.com/Swader/diffbot-php-client/issues/2)
99

1010
## Medium
1111

12-
- add streaming to Crawlbot - make it stream the result (it constantly grows)
13-
- implement Video API (currently beta)
12+
- [add streaming to Crawlbot](https://github.com/Swader/diffbot-php-client/issues/5)
13+
- [implement Video API](https://github.com/Swader/diffbot-php-client/issues/6) (currently beta)
14+
- [implement Webhook](https://github.com/Swader/diffbot-php-client/issues/7) for Bulk / Crawlbot completion
15+
- look into adding async support via Guzzle
16+
- consider alternative solution to 'crawl' setting in Api abstract ([#8](https://github.com/Swader/diffbot-php-client/issues/8)).
17+
- API docs needed ([#9](https://github.com/Swader/diffbot-php-client/issues/3))
1418

1519
## Low
1620

21+
- see what can be done with the [URL report](https://www.diffbot.com/dev/docs/crawl/) - some implementation options?
1722
- add more usage examples
1823
- work on PhpDoc consistency ($param type vs type $param)
1924
- get more mock responses and test against them

src/Abstracts/Api.php

+27-36
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
namespace Swader\Diffbot\Abstracts;
44

55
use Swader\Diffbot\Diffbot;
6+
use Swader\Diffbot\Traits\DiffbotAware;
67

78
/**
89
* Class Api
@@ -28,26 +29,29 @@ abstract class Api implements \Swader\Diffbot\Interfaces\Api
2829
/** @var Diffbot The parent class which spawned this one */
2930
protected $diffbot;
3031

32+
use DiffbotAware;
3133

3234
public function __construct($url)
3335
{
34-
$url = trim((string)$url);
35-
if (strlen($url) < 4) {
36-
throw new \InvalidArgumentException(
37-
'URL must be a string of at least four characters in length'
38-
);
39-
}
40-
41-
$url = (isset(parse_url($url)['scheme'])) ? $url : "http://$url";
42-
43-
$filtered_url = filter_var($url, FILTER_VALIDATE_URL);
44-
if (!$filtered_url) {
45-
throw new \InvalidArgumentException(
46-
'You provided an invalid URL: ' . $url
47-
);
36+
if (strcmp($url, 'crawl') !== 0) {
37+
$url = trim((string)$url);
38+
if (strlen($url) < 4) {
39+
throw new \InvalidArgumentException(
40+
'URL must be a string of at least four characters in length'
41+
);
42+
}
43+
44+
$url = (isset(parse_url($url)['scheme'])) ? $url : "http://$url";
45+
46+
$filtered_url = filter_var($url, FILTER_VALIDATE_URL);
47+
if (!$filtered_url) {
48+
throw new \InvalidArgumentException(
49+
'You provided an invalid URL: ' . $url
50+
);
51+
}
52+
$url = $filtered_url;
4853
}
49-
50-
$this->url = $filtered_url;
54+
$this->url = $url;
5155
}
5256

5357
/**
@@ -91,14 +95,15 @@ public function call()
9195

9296
public function buildUrl()
9397
{
94-
$url = rtrim($this->apiUrl, '/');
98+
$url = rtrim($this->apiUrl, '/').'?';
9599

96-
// Add Token
97-
$url .= '?token=' . $this->diffbot->getToken();
98-
99-
// Add URL
100-
$url .= '&url=' . urlencode($this->url);
100+
if (strcmp($url,'crawl') !== 0) {
101+
// Add Token
102+
$url .= 'token=' . $this->diffbot->getToken();
101103

104+
// Add URL
105+
$url .= '&url=' . urlencode($this->url);
106+
}
102107

103108
// Add Custom Fields
104109
$fields = $this->fieldSettings;
@@ -118,18 +123,4 @@ public function buildUrl()
118123

119124
return $url;
120125
}
121-
122-
/**
123-
* Sets the Diffbot instance on the child class
124-
* Used to later fetch the token, HTTP client, EntityFactory, etc
125-
* @param Diffbot $d
126-
* @return $this
127-
*/
128-
public function registerDiffbot(Diffbot $d)
129-
{
130-
$this->diffbot = $d;
131-
132-
return $this;
133-
}
134-
135126
}

0 commit comments

Comments
 (0)