(feat) Add support for robots.txt by default #668

BRNMan · 2025-04-03T20:22:32Z

Related Issues

N/A

Proposed Changes:

This PR adds support for robots.txt by default. This PR uses the Protego lbrary to parse robots.txt for source files in the source.download method. All subsequent downloads have to pass robots.txt to go through. If robots.txt is not found, all requests should pass.

Whenever your bot is disallowed for the source url, we throw a RobotsException. If your bot is disallowed for some article url, we just print a warning and keep going, no exception.

This should help people avoid being banned by news sites for scraping.

If you still want to ignore robots.txt there is a new source.config option called dont_obey_robotstxt that you can set to true to ignore robots.txt. This option is False by default.

How did you test it?

I added a unit test for robots.txt with a user agent that people like to disallow. I regression tested using the existing tests.

Notes for the reviewer

Checklist

I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

Add support for robots.txt by default.

21712cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(feat) Add support for robots.txt by default #668

(feat) Add support for robots.txt by default #668

BRNMan commented Apr 3, 2025

(feat) Add support for robots.txt by default #668

Are you sure you want to change the base?

(feat) Add support for robots.txt by default #668

Conversation

BRNMan commented Apr 3, 2025

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist