-
Setup environment using
Pipenv
(Python 3.9) -
Add
.env
&.env.template
for environment variables -
Make CLI using
click
orTyper
-
Add
mypy
tests -
Fully utilize
scrapy
Spyders (currently works withrequests
) -
Setup
cron
jobs for automatic daily scraping
-
Add type hints
-
Stylize code to be max 80 char.-s per row
-
Convert all
str.format()
syntax to f-strings(f"{}"
) -
Setup proper architecture for: Scraper - DB - DAE - API - Dashboard
-
Add docstrings
-
When svaing to
.json
, stylize/prettify the content -
Correct the encoding for arm characters (UTF-8)
-
Add URL to extracted data fields
-
Add Company URL to extracted data fields
-
Add
Foundation_date
andTelephone
fields to companies -
Scrape all companies so that new companies can be detected properly
-
Add functionality for scraping company information
-
Change data storing directory (currently in
notebooks/
) -
Add progress bars
-
Utilize
collections.defaultdict
-
Utilize
urllib.parse.urljoin
for base URL and relative pages' joining -
Need to update
Job_views
field daily -
Implement saving to
.csv
functionality -
Save logs for a daily crawling (appending msg.-s in
main
function) -
Need to check previous scraped data for avoiding duplicate crawling
-
Add summarizing daily logs with following fields:
{ "date": { "date": "datetime", "weekday": "str", "postings": "int", "new_postings": "int", "new_companies": "int" } }
-
Change behaviour of crawler to store
new h3 field
message's info inAdditional_Info
field, instead of printing in the console -
Change
tqdm
message so that it prints the URL being scraped -
Email notification if scraping fails for some reason
-
Fix company title fetching (currently
None
) -
Store
int
-s &float
-s properly in.json
files (stored asstr
) -
Recover URL-s for previous scraped data
-
Fix company storing data (
list.extend()
instead oflist.append()
) -
Finalize Company info crawling in
main()
function -
Fix bug related to
crawl_all_companies()
output (list
instead ofdict
) -
Fix Company
Info
field scraping (appends all companies together) -
Strip scraped
str
data (eg.:Company_Title
) -
Fix scraping of
Additional_information
field
-
Setup RDBMS or NoSQL (
PostgreSQL
/SQLite
orMongoDB
/Redis
) -
Setup ORM (Object Relational Mapper:
SQLAlchemy.orm
/PeeWee
)