Skip to content

Latest commit

 

History

History
111 lines (62 loc) · 2.88 KB

TODO.md

File metadata and controls

111 lines (62 loc) · 2.88 KB

List of features to add (TODO)


Development Tools

  • Setup environment using Pipenv (Python 3.9)

  • Add .env & .env.template for environment variables

  • Make CLI using click or Typer

  • Add mypy tests

  • Fully utilize scrapy Spyders (currently works with requests)

  • Setup cron jobs for automatic daily scraping

Architecture and code optimization

  • Add type hints

  • Stylize code to be max 80 char.-s per row

  • Convert all str.format() syntax to f-strings(f"{}")

  • Setup proper architecture for: Scraper - DB - DAE - API - Dashboard

  • Add docstrings

Functionalities & changes

  • When svaing to .json, stylize/prettify the content

  • Correct the encoding for arm characters (UTF-8)

  • Add URL to extracted data fields

  • Add Company URL to extracted data fields

  • Add Foundation_date and Telephone fields to companies

  • Scrape all companies so that new companies can be detected properly

  • Add functionality for scraping company information

  • Change data storing directory (currently in notebooks/)

  • Add progress bars

  • Utilize collections.defaultdict

  • Utilize urllib.parse.urljoin for base URL and relative pages' joining

  • Need to update Job_views field daily

  • Implement saving to .csv functionality

  • Save logs for a daily crawling (appending msg.-s in main function)

  • Need to check previous scraped data for avoiding duplicate crawling

  • Add summarizing daily logs with following fields:

    {
        "date": {
            "date": "datetime",
            "weekday": "str",
            "postings": "int",
            "new_postings": "int",
            "new_companies": "int"
        }
    }
  • Change behaviour of crawler to store new h3 field message's info in Additional_Info field, instead of printing in the console

  • Change tqdm message so that it prints the URL being scraped

  • Email notification if scraping fails for some reason

Bugs

  • Fix company title fetching (currently None)

  • Store int-s & float-s properly in .json files (stored as str)

  • Recover URL-s for previous scraped data

  • Fix company storing data (list.extend() instead of list.append())

  • Finalize Company info crawling in main() function

  • Fix bug related to crawl_all_companies() output (list instead of dict)

  • Fix Company Info field scraping (appends all companies together)

  • Strip scraped str data (eg.: Company_Title)

  • Fix scraping of Additional_information field

Database

  • Setup RDBMS or NoSQL (PostgreSQL/SQLite or MongoDB/Redis)

  • Setup ORM (Object Relational Mapper: SQLAlchemy.orm/PeeWee)