[fix] add relative link parsing to category generation #667
+25
−10
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related Issues
Proposed Changes:
Newspaper expects href links to be in the form
<a href="<scheme>://<domain>.<tld>/<path>"></a>
However, links are commonly in the form
<a href="<path>"></a>
or<a href="/<path>"></a>
If urlparse doesn't find a scheme and domain, that means the href will be processed as a relative path on the website, so we should process it that way in newspaper.
How did you test it?
I ran the new york times and theverge
Results before PR:
['https://www.nytimes.com/international', 'https://nytimes.pressreader.com', 'https://www.nytimes.com/section/opinion', 'https://www.nytimes.com/section/todayspaper', 'https://www.nytimes.com/tips', 'https://www.nytimes.com/es', 'https://www.nytimes.com/ca', 'https://www.nytimes.com/crosswords', 'https://www.nytimes.com/wirecutter', 'https://www.nytimes.com/athletic', 'https://cooking.nytimes.com', 'https://www.nytimes.com/gift', 'https://www.nytimes.com/']
Results after PR:
['https://www.nytimes.com/section/education', 'https://www.nytimes.com/section/learning', 'https://www.nytimes.com/section/t-magazine', 'https://www.nytimes.com/section/reader-center', 'https://www.nytimes.com/', 'https://cooking.nytimes.com', 'https://www.nytimes.com/section/health', 'https://www.nytimes.com/section/technology', 'https://www.nytimes.com/section/us', 'https://www.nytimes.com/athletic', 'https://www.nytimes.com/es', 'https://www.nytimes.com/section/theater', 'https://www.nytimes.com/section/science', 'https://www.nytimes.com/wirecutter', 'https://www.nytimes.com/section/headway', 'https://www.nytimes.com/section/magazine', 'https://www.nytimes.com/section/travel', 'https://www.nytimes.com/section/opinion', 'https://www.nytimes.com/section/sports', 'https://www.nytimes.com/video', 'https://www.nytimes.com/ca', 'https://www.nytimes.com/crosswords', 'https://www.nytimes.com/section/corrections', 'https://www.nytimes.com/tips', 'https://www.nytimes.com/section/business', 'https://www.nytimes.com/section/obituaries', 'https://www.nytimes.com/gift-articles', 'https://www.nytimes.com/section/politics', 'https://nytimes.pressreader.com', 'https://www.nytimes.com/section/movies', 'https://www.nytimes.com/section/realestate', 'https://www.nytimes.com/gift', 'https://www.nytimes.com/section/nyregion', 'https://www.nytimes.com/section/style', 'https://www.nytimes.com/section/food', 'https://www.nytimes.com/section/todayspaper', 'https://www.nytimes.com/section/world', 'https://www.nytimes.com/section/well', 'https://www.nytimes.com/international', 'https://www.nytimes.com/section/fashion', 'https://www.nytimes.com/trending']
I also tested theverge.com
Notes for the reviewer
I also refactored the name of filter_tld to be more accurate.
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.