Skip to content

(feat) Add support for robots.txt by default #668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

BRNMan
Copy link

@BRNMan BRNMan commented Apr 3, 2025

Related Issues

N/A

Proposed Changes:

This PR adds support for robots.txt by default. This PR uses the Protego lbrary to parse robots.txt for source files in the source.download method. All subsequent downloads have to pass robots.txt to go through. If robots.txt is not found, all requests should pass.

Whenever your bot is disallowed for the source url, we throw a RobotsException. If your bot is disallowed for some article url, we just print a warning and keep going, no exception.

This should help people avoid being banned by news sites for scraping.

If you still want to ignore robots.txt there is a new source.config option called dont_obey_robotstxt that you can set to true to ignore robots.txt. This option is False by default.

How did you test it?

I added a unit test for robots.txt with a user agent that people like to disallow. I regression tested using the existing tests.

Notes for the reviewer

Checklist

  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant