Skip to content

Commit fe9a988

Browse files
committed
Revert "recommend using bz compression over gz"
This reverts commit ece1374. experience with bz prooved less stable than desired
1 parent c252fab commit fe9a988

File tree

6 files changed

+12
-13
lines changed

6 files changed

+12
-13
lines changed

.gitignore

-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
node_modules
22
build
33
*ndjson
4-
*.extract.*
54
*.sublime*

README.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -59,10 +59,10 @@ See [CHANGELOG.md](CHANGELOG.md) for version info
5959
## Download dump
6060

6161
### Wikidata dumps
62-
Wikidata provides a bunch of [database dumps](https://www.wikidata.org/wiki/Wikidata:Database_download), among which the desired [JSON dump](https://www.wikidata.org/wiki/Wikidata:Database_download#JSON_dumps_.28recommended.29). As a Wikidata dump is a very laaarge file (September 2020: 55GB compressed), it is recommended to download that file first before doing operations on it, so that if anything crashes, you don't have to start the download from zero (the download time being usually the bottleneck).
62+
Wikidata provides a bunch of [database dumps](https://www.wikidata.org/wiki/Wikidata:Database_download), among which the desired [JSON dump](https://www.wikidata.org/wiki/Wikidata:Database_download#JSON_dumps_.28recommended.29). As a Wikidata dump is a very laaarge file (April 2020: 75GB compressed), it is recommended to download that file first before doing operations on it, so that if anything crashes, you don't have to start the download from zero (the download time being usually the bottleneck).
6363
```sh
64-
wget -C https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2
65-
cat latest-all.json.bz2 | bzcat | wikibase-dump-filter --claim P31:Q5 > humans.ndjson
64+
wget -C https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz
65+
cat latest-all.json.gz | gzip -d | wikibase-dump-filter --claim P31:Q5 > humans.ndjson
6666
```
6767

6868
### Your own Wikibase instance dump

docs/cli.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ this command filters `entities_dump.json` into a subset where all lines are the
4141

4242
* **directly from a Wikidata dump**
4343
```sh
44-
curl https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 | bzcat | wikibase-dump-filter --claim P31:Q5 > humans.ndjson
44+
curl https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz | gzip -d | wikibase-dump-filter --claim P31:Q5 > humans.ndjson
4545
```
4646
this can be quite convinient when you don't have enough space to keep the whole decompressed dump on your disk: here you only write the desired subset.
4747

docs/examples.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,6 @@
55
The [equivalent SPARQL query](https://query.wikidata.org/#SELECT%20%3Fs%20%3FsLabel%20WHERE%20%7B%0A%20%20%20%20%3Farticlea%20schema%3Aabout%20%3Fs%20.%0A%20%20%20%20%3Farticlea%20schema%3AisPartOf%20%3Chttps%3A%2F%2Fzh.wikipedia.org%2F%3E%20.%0A%20%20%20%20%3Farticleb%20schema%3Aabout%20%3Fs%20.%0A%20%20%20%20%3Farticleb%20schema%3AisPartOf%20%3Chttps%3A%2F%2Ffr.wikipedia.org%2F%3E%20.%0A%20%20%20%20%23%20The%20article%20shouldn%27t%20be%20a%20disambiguation%20page%0A%20%20%09FILTER%20NOT%20EXISTS%20%7B%20%3Fs%20wdt%3AP31%20wd%3AQ4167410.%20%7D%0A%7D%0ALIMIT%205) times out
66

77
```sh
8-
DUMP='https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2'
9-
curl $DUMP | bzcat | wikibase-dump-filter --sitelink 'zhwiki&frwiki' --keep id,labels,sitelinks --languages zh,fr --simplify > subset.ndjson
8+
DUMP='https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz'
9+
curl $DUMP | gzip -d | wikibase-dump-filter --sitelink 'zhwiki&frwiki' --keep id,labels,sitelinks --languages zh,fr --simplify > subset.ndjson
1010
```

docs/parallelize.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,19 @@
11
## Parallelize
22
If your hardware happens to have several cores, we can do better:
3-
* replace `bzcat` by [`pbzip2`](https://linux.die.net/man/1/pbzip2)
3+
* replace `gzip` by [`pigz`](https://zlib.net/pigz/)
44
* load balance lines over several `wikibase-dump-filter` processes using [`load-balance-lines`](https://github.com/maxlath/load-balance-lines) or something that does the same job
55

66
```sh
77
# install those new dependencies
8-
sudo apt-get install pbzip2
8+
sudo apt-get install pigz
99
npm install --global load-balance-lines
1010

1111
# increase the max RAM available to node processes, to prevent allocation errors
1212
NODE_OPTIONS=--max_old_space_size=4096
1313

14-
wget --continue https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2
15-
nice -n+19 pbzip2 -cd < latest-all.json.bz2 | nice -n+19 load-balance-lines wikibase-dump-filter --claim P31:Q5 > humans.ndjson
14+
wget --continue https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz
15+
nice -n+19 pigz -d < latest-all.json.gz | nice -n+19 load-balance-lines wikibase-dump-filter --claim P31:Q5 > humans.ndjson
1616
```
1717

1818
Using [`nice`](http://man7.org/linux/man-pages/man1/nice.1.html) to tell the system that those processes, while eager to eat all the CPUs, should have the lowest priority.
19-
If you are not familiar with the `<` operator, it's the equivalent of `cat latest-all.json.bz2 | nice -n+19 pbzip2 -cd` but in a shell built-in way. (See [I/O Redirection doc](http://www.tldp.org/LDP/abs/html/io-redirection.html))
19+
If you are not familiar with the `<` operator, it's the equivalent of `cat latest-all.json.gz | nice -n+19 pigz -d` but in a shell built-in way. (See [I/O Redirection doc](http://www.tldp.org/LDP/abs/html/io-redirection.html))

scripts/init_fixtures

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,6 @@
33

44
mkdir -p test/fixtures
55
# get the first 50 lines of the latest dump
6-
curl -s https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 | bzcat | head -n 50 > ./test/fixtures/entities
6+
curl -s https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz |gzip -d |head -n 50 > ./test/fixtures/entities
77

88
cat ./test/fixtures/entities |head -n 2 |tail -n 1 |sed 's/,$//' > ./test/fixtures/entity

0 commit comments

Comments
 (0)