-
-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add document language to metadata #224
Comments
feel free to cherry pick my commit or something, getorca@3148d9f, I haven't written any tests for it, but none fail which is strange, because it returns a value for the language attr. I didn't do a pull request due to the lack of tests, and you may be working one something. It may be better to add the filter here, rather than running the code more or less twice if a target_language param is specified. |
@getorca Thanks for the hint, your function looks interesting, unfortunately HTML meta tags don't always correspond to the content, I believe it would be better to just apply language detection on the content and to apply it on HTML tags only when "fast" and/or "strict" extraction are set. As you say the code segments concerning language filtering could be regrouped, but strict filtering based on HTML meta info prevents the extraction from being run in certain cases, which saves time. Would you be interested in drafting a pull request? |
This is a very good point. It would be interesting to run a benchmark off of a sample common crawl data. Looking to see how often html language tag is correct. I'm sure that exists somewhere, but where?
I can take a look, I'm working on benchmarking various extractors, resiliparse(https://resiliparse.chatnoir.eu/en/stable/) for content extraction is extremely fast and has sufficient metrics. I haven't benchmarked the language extraction yet, but if it's not suburb, I'll take a deeper look at the code here. Or there is the possibility to replace with resiliparse lang detection in this module. |
I'm not aware of such a benchmark (HTML lang vs. actual language) but I'd also be curious. Please keep me updated with the extraction benchmark, I'm interested! I can also take care of the implementation but if you have something specific in mind (like the resiliparse code you're taking about) feel free to make a PR. |
Hi guys! getorca@3148d9f |
Yes that sounds nice, feel free to write a PR. |
The
target_language
parameter can filter documents according to their language, but it is also possible to pass this information along, based on the HTML meta and text language detectors.The text was updated successfully, but these errors were encountered: