Skip to content

[fix] add relative link parsing to category generation #667

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

BRNMan
Copy link

@BRNMan BRNMan commented Apr 1, 2025

Related Issues

Proposed Changes:

Newspaper expects href links to be in the form <a href="<scheme>://<domain>.<tld>/<path>"></a>
However, links are commonly in the form <a href="<path>"></a> or <a href="/<path>"></a>

If urlparse doesn't find a scheme and domain, that means the href will be processed as a relative path on the website, so we should process it that way in newspaper.

How did you test it?

I ran the new york times and theverge

import newspaper
import logging

# Set up the logger
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.StreamHandler()
    ]
)

config = newspaper.Config()
config.disable_category_cache=True
config.memorize_articles = False
nyt_paper = newspaper.Source('https://www.nytimes.com', config=config)
nyt_paper.build()
print(nyt_paper.category_urls())

Results before PR:
['https://www.nytimes.com/international', 'https://nytimes.pressreader.com', 'https://www.nytimes.com/section/opinion', 'https://www.nytimes.com/section/todayspaper', 'https://www.nytimes.com/tips', 'https://www.nytimes.com/es', 'https://www.nytimes.com/ca', 'https://www.nytimes.com/crosswords', 'https://www.nytimes.com/wirecutter', 'https://www.nytimes.com/athletic', 'https://cooking.nytimes.com', 'https://www.nytimes.com/gift', 'https://www.nytimes.com/']

Results after PR:
['https://www.nytimes.com/section/education', 'https://www.nytimes.com/section/learning', 'https://www.nytimes.com/section/t-magazine', 'https://www.nytimes.com/section/reader-center', 'https://www.nytimes.com/', 'https://cooking.nytimes.com', 'https://www.nytimes.com/section/health', 'https://www.nytimes.com/section/technology', 'https://www.nytimes.com/section/us', 'https://www.nytimes.com/athletic', 'https://www.nytimes.com/es', 'https://www.nytimes.com/section/theater', 'https://www.nytimes.com/section/science', 'https://www.nytimes.com/wirecutter', 'https://www.nytimes.com/section/headway', 'https://www.nytimes.com/section/magazine', 'https://www.nytimes.com/section/travel', 'https://www.nytimes.com/section/opinion', 'https://www.nytimes.com/section/sports', 'https://www.nytimes.com/video', 'https://www.nytimes.com/ca', 'https://www.nytimes.com/crosswords', 'https://www.nytimes.com/section/corrections', 'https://www.nytimes.com/tips', 'https://www.nytimes.com/section/business', 'https://www.nytimes.com/section/obituaries', 'https://www.nytimes.com/gift-articles', 'https://www.nytimes.com/section/politics', 'https://nytimes.pressreader.com', 'https://www.nytimes.com/section/movies', 'https://www.nytimes.com/section/realestate', 'https://www.nytimes.com/gift', 'https://www.nytimes.com/section/nyregion', 'https://www.nytimes.com/section/style', 'https://www.nytimes.com/section/food', 'https://www.nytimes.com/section/todayspaper', 'https://www.nytimes.com/section/world', 'https://www.nytimes.com/section/well', 'https://www.nytimes.com/international', 'https://www.nytimes.com/section/fashion', 'https://www.nytimes.com/trending']

I also tested theverge.com

Notes for the reviewer

I also refactored the name of filter_tld to be more accurate.

Checklist

  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@BRNMan BRNMan mentioned this pull request Apr 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Relative linked categories aren't recognized
1 participant