Skip to content

Files

Latest commit

bfca3ec · Feb 20, 2017

History

History

ex1

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Feb 20, 2017
Feb 20, 2017
Feb 20, 2017

Exercise 1: Web Crawling for Archival Data

  • Condition: you are trying to download scanned documents from an archive website. Since there are too many PDF files, it takes too much time to download one by one. You want to use Python scripts to download documents on the web.

  • Executing the code

    1. Go to python/ex1/notebook/
    2. Type jupyter notebook and hit enter.
    3. The code is already there. Execute block by block using Shift + Enter
    4. The output files will be saved in python/ex1/download/
    5. Check the PDF files using Excel.
  • Try changing the range of the documents

    • For now, the range of the document is set from 1 to 10.
    • Change the range so that you can download a different set of documents (don't set it too broad for this exercise -- just for saving your time now).
    • Re-run the scripts.
    • Check the download folder whether all the files are successfully downloaded.
  • Exporting the code to a HTML file with Markdown-styled text.

    1. Position your cursor in a block.
    2. Try to insert a new block by clicking the Insert menu.
    3. Change the new block's mode to Markdown.
    4. Try to type Markdown wordings.
    5. In the menu bar, click File -> Download as -> HTML
    6. Open the downloaded file in your browser. It is a pure HTML file automatically generated from Jupyter Notebook. In this way, you can generate a Python-based styled document for web.