Skip to content

Releases: dnuffer/open_images_downloader

3.0

20 Nov 16:17
Compare
Choose a tag to compare

Add support for downloading Open Images Dataset V3 and V1 (in addition to V2).

v1.0

08 Sep 14:32
Compare
Choose a tag to compare

Open Images dataset downloader

This program is built for downloading, verifying and resizing the images and metadata of the Open Images dataset (https://github.com/openimages/dataset). It is designed to run as fast as possible by taking advantage of the available hardware and bandwidth by using asynchronous I/O and parallelism. Each image's size and md5 sum are validated against the values found in the dataset metadata. The download results are stored in CSV files with the same format as the original images.csv so that subsequent use for training, etc. can be done knowing that all the images are available. Many (over 2%) of the original images are no longer available or have changed, so these and any other failed downloads are stored in a separate results file. If you use a program such as curl or wget to download the images, you will end up with a lot of "unavailable" png images, xml files and some images that don't match the originals. This is why it's important to validate the size and md5 sum when downloading.

The application is written in scala and depends on the Java jre to run. It is distributed with a shell script and batch file generated by sbt-native-packager (http://www.scala-sbt.org/sbt-native-packager/index.html), so all you need to run it is to execute open_images_downloader (or the .bat on windows).

The resizing functionality depends on the ImageMagick program convert, so if you want to do resizing, convert must be on the PATH. ImageMagick provides excellent quality resizing and very fast performance. It's easy to install on a linux distribution using a package manager (e.g. apt or yum), and not too hard on most other OSes. See https://www.imagemagick.org/script/binary-releases.php.

The application is a command line application. If you're running it on a server, I'd recommend using screen or tmux so that it continues running if the ssh connection is interrupted.

The code is written in a portable manner, but I haven't tested on any OS besides Ubuntu Linux, so if you use a different OS and run into issues let me know by opening an issue on github and I'll do my best to help you out.

This program is flexible and can be used for a number of use cases depending on how much storage you want to use and how you want to use the data. The original images can optionally be stored, and you can also choose whether to resize and store the resized images. Also the metadata download and extraction is optional. If the original images are found locally because you previously downloaded them, they will be used as the source for a resize. Also a resize is skipped if a file with size > 0 is found. This is so the program can be interrupted and restarted and you can resume it where it left off. Also if you have the original images stored locally you can resize all the images with different parameters without needing to re-download any images.

Example usages

If you want to minimize the amount of space used, only store small images 224x224 compressed at jpeg quality 50, and use less bandwidth by downloading the 300K urls, use the following command line options:

$ open_images_downloader --nodownload-metadata --download-300k \
    --resize-mode FillCrop --resize-compression-quality 50  

If you want to save the images with a max side of 640 with original aspect ratio at original jpeg quality, and use less bandwidth by downloading the 300K urls, use the following command line options. Note that the 300K don't look as nice as the original images resized by ImageMagick. The 300k urls return images that are 640 pixels on the largest side, so the resize step only changes images that are larger than 640. Not all images have 300K urls, and in that case, the original url is used and these images are resized.

$ open_images_downloader --nodownload-metadata --download-300k \
    --resize-box-size 640

If you want to download and save all the original images and metadata, and also resize them to 1024 max side, and save them in a subdirectory named images-resized-1024:

$ open_images_downloader --save-original-images --resize-box-size 1024 \
    --resized-images-subdirectory images-resized-1024

Command Line Options

There are also options for controlling how many concurrent http connections are made (don't worry about flickr, they can easily handle a few hundred connections from a single system and you downloading as fast as possible, and you won't be blocked for "abuse") which you may want to use to reduce the impact you have on your local network (you don't want your kids complaining that Netflix is "buffering" and looking all blocky, do you?!?)

Here is the complete command line help:

open_images_downloader 1.0 by Dan Nuffer
Usage: open_images_downloader[.bat] [OPTION]...

Options:

      --check-md5-if-exists                   If an image already exists locally
                                              in <image dir> and is the same
                                              size as the original, check the
                                              md5 sum of the file to determine
                                              whether to download it. Default is
                                              on
      --nocheck-md5-if-exists
      --download-300k                         Download the image from the url in
                                              the Thumbnail300KURL field. This
                                              disables verifying the size and
                                              md5 hash and results in lower
                                              quality images, but may be much
                                              faster and use less bandwidth and
                                              storage space. These are resized
                                              to a max dim of 640, so if you use
                                              --resize-mode=ShrinkToFit and
                                              --resize-box-size=640 you can get
                                              a full consistently sized set of
                                              images. For the few images that
                                              don't have a 300K url the original
                                              is downloaded and needs to be
                                              resized. Default is off
      --nodownload-300k
      --download-images                       Download and extract
                                              images_2017_07.tar.gz and all
                                              images. Default is on
      --nodownload-images
      --download-metadata                     Download and extract the metadata
                                              files (annotations and classes).
                                              Default is on
      --nodownload-metadata
      --http-pipelining-limit  <arg>          The maximum number of parallel
                                              pipelined http requests per
                                              connection. Default is 4
      --log-file  <arg>                       Write a log to <file>. Default is
                                              to not write a log
      --log-to-stdout                         Write the log to stdout. Default
                                              is on
      --nolog-to-stdout
      --max-host-connections  <arg>           The maximum number of parallel
                                              connections to a single host.
                                              Default is 5
      --max-retries  <arg>                    Number of times to retry failed
                                              downloads. Default is 15.
      --max-total-connections  <arg>          The maximum number of parallel
                                              connections to all hosts. Must be
                                              a power of 2 and > 0. Default is
                                              128
      --original-images-subdirectory  <arg>   name of the subdirectory where the
                                              original images are stored.
                                              Default is images-original
      --resize-box-size  <arg>                The number of pixels used by
                                              resizing for the side of the
                                              bounding box. Default is 224
      --resize-compression-quality  <arg>     The compression quality. If
                                              specified, it will be passed with
                                              the -quality option to imagemagick
                                              convert. See
                                              https://www.imagemagick.org/script/command-line-options.php#quality
                                              for the meaning of different
                                              values and defaults for various
                                              output formats. If unspecified,
                                              -quality will not be passed and
                                              imagemagick will use its default
      --resize-images                         Resize images. Default is on
      --noresize-images
      --resize-mode  <arg>                    One of ShrinkToFit, FillCrop, or
                                              FillDistort. ShrinkToFit will
                                              resize images larger than the
                                              specified size of bounding box,
                                              preserving aspect ratio. Smaller...
Read more