Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add document language to metadata #224

Open
adbar opened this issue Jul 19, 2022 · 6 comments
Open

Add document language to metadata #224

adbar opened this issue Jul 19, 2022 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@adbar
Copy link
Owner

adbar commented Jul 19, 2022

The target_language parameter can filter documents according to their language, but it is also possible to pass this information along, based on the HTML meta and text language detectors.

@adbar adbar added the enhancement New feature or request label Jul 19, 2022
@adbar adbar added this to the v2.0 milestone Sep 8, 2022
@adbar adbar self-assigned this Sep 13, 2022
@getorca
Copy link

getorca commented Oct 21, 2022

feel free to cherry pick my commit or something, getorca@3148d9f, I haven't written any tests for it, but none fail which is strange, because it returns a value for the language attr.

I didn't do a pull request due to the lack of tests, and you may be working one something.

It may be better to add the filter here, rather than running the code more or less twice if a target_language param is specified.

@adbar
Copy link
Owner Author

adbar commented Oct 24, 2022

@getorca Thanks for the hint, your function looks interesting, unfortunately HTML meta tags don't always correspond to the content, I believe it would be better to just apply language detection on the content and to apply it on HTML tags only when "fast" and/or "strict" extraction are set.

As you say the code segments concerning language filtering could be regrouped, but strict filtering based on HTML meta info prevents the extraction from being run in certain cases, which saves time.

Would you be interested in drafting a pull request?

@getorca
Copy link

getorca commented Oct 24, 2022

@getorca Thanks for the hint, your function looks interesting, unfortunately HTML meta tags don't always correspond to the content, I believe it would be better to just apply language detection on the content and to apply it on HTML tags only when "fast" and/or "strict" extraction are set.

This is a very good point. It would be interesting to run a benchmark off of a sample common crawl data. Looking to see how often html language tag is correct. I'm sure that exists somewhere, but where?

As you say the code segments concerning language filtering could be regrouped, but strict filtering based on HTML meta info prevents the extraction from being run in certain cases, which saves time.

Would you be interested in drafting a pull request?

I can take a look, I'm working on benchmarking various extractors, resiliparse(https://resiliparse.chatnoir.eu/en/stable/) for content extraction is extremely fast and has sufficient metrics. I haven't benchmarked the language extraction yet, but if it's not suburb, I'll take a deeper look at the code here. Or there is the possibility to replace with resiliparse lang detection in this module.

@adbar
Copy link
Owner Author

adbar commented Oct 24, 2022

I'm not aware of such a benchmark (HTML lang vs. actual language) but I'd also be curious.

Please keep me updated with the extraction benchmark, I'm interested!

I can also take care of the implementation but if you have something specific in mind (like the resiliparse code you're taking about) feel free to make a PR.

@semoal
Copy link

semoal commented Feb 1, 2024

Hi guys! getorca@3148d9f
Would be amazing if we could add this feature :)
If you don't mind I can create a pull-request

@adbar
Copy link
Owner Author

adbar commented Feb 1, 2024

Yes that sounds nice, feel free to write a PR.

@adbar adbar removed this from the v2.0 milestone Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants