List of features to add (TODO)

Development Tools

Setup environment using Pipenv (Python 3.9)
Add .env & .env.template for environment variables
Make CLI using click or Typer
Add mypy tests
Fully utilize scrapy Spyders (currently works with requests)
Setup cron jobs for automatic daily scraping

Architecture and code optimization

Add type hints
Stylize code to be max 80 char.-s per row
Convert all str.format() syntax to f-strings(f"{}")
Setup proper architecture for: Scraper - DB - DAE - API - Dashboard
Add docstrings

Functionalities & changes

When svaing to .json, stylize/prettify the content
Correct the encoding for arm characters (UTF-8)
Add URL to extracted data fields
Add Company URL to extracted data fields
Add Foundation_date and Telephone fields to companies
Scrape all companies so that new companies can be detected properly
Add functionality for scraping company information
Change data storing directory (currently in notebooks/)
Add progress bars
Utilize collections.defaultdict
Utilize urllib.parse.urljoin for base URL and relative pages' joining
Need to update Job_views field daily
Implement saving to .csv functionality
Save logs for a daily crawling (appending msg.-s in main function)
Need to check previous scraped data for avoiding duplicate crawling

Add summarizing daily logs with following fields:

{
    "date": {
        "date": "datetime",
        "weekday": "str",
        "postings": "int",
        "new_postings": "int",
        "new_companies": "int"
    }
}

Change behaviour of crawler to store new h3 field message's info in Additional_Info field, instead of printing in the console
Change tqdm message so that it prints the URL being scraped
Email notification if scraping fails for some reason

Bugs

Fix company title fetching (currently None)
Store int-s & float-s properly in .json files (stored as str)
Recover URL-s for previous scraped data
Fix company storing data (list.extend() instead of list.append())
Finalize Company info crawling in main() function
Fix bug related to crawl_all_companies() output (list instead of dict)
Fix Company Info field scraping (appends all companies together)
Strip scraped str data (eg.: Company_Title)
Fix scraping of Additional_information field

Database

Setup RDBMS or NoSQL (PostgreSQL/SQLite or MongoDB/Redis)
Setup ORM (Object Relational Mapper: SQLAlchemy.orm/PeeWee)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TODO.md

TODO.md

List of features to add (TODO)

Development Tools

Architecture and code optimization

Functionalities & changes

Bugs

Database

Files

TODO.md

Latest commit

History

TODO.md

File metadata and controls

List of features to add (TODO)

Development Tools

Architecture and code optimization

Functionalities & changes

Bugs

Database