Add document language to metadata #224

adbar · 2022-07-19T10:40:55Z

The target_language parameter can filter documents according to their language, but it is also possible to pass this information along, based on the HTML meta and text language detectors.

The text was updated successfully, but these errors were encountered:

getorca · 2022-10-21T18:26:02Z

feel free to cherry pick my commit or something, getorca@3148d9f, I haven't written any tests for it, but none fail which is strange, because it returns a value for the language attr.

I didn't do a pull request due to the lack of tests, and you may be working one something.

It may be better to add the filter here, rather than running the code more or less twice if a target_language param is specified.

adbar · 2022-10-24T11:04:54Z

@getorca Thanks for the hint, your function looks interesting, unfortunately HTML meta tags don't always correspond to the content, I believe it would be better to just apply language detection on the content and to apply it on HTML tags only when "fast" and/or "strict" extraction are set.

As you say the code segments concerning language filtering could be regrouped, but strict filtering based on HTML meta info prevents the extraction from being run in certain cases, which saves time.

Would you be interested in drafting a pull request?

getorca · 2022-10-24T15:40:26Z

@getorca Thanks for the hint, your function looks interesting, unfortunately HTML meta tags don't always correspond to the content, I believe it would be better to just apply language detection on the content and to apply it on HTML tags only when "fast" and/or "strict" extraction are set.

This is a very good point. It would be interesting to run a benchmark off of a sample common crawl data. Looking to see how often html language tag is correct. I'm sure that exists somewhere, but where?

As you say the code segments concerning language filtering could be regrouped, but strict filtering based on HTML meta info prevents the extraction from being run in certain cases, which saves time.

Would you be interested in drafting a pull request?

I can take a look, I'm working on benchmarking various extractors, resiliparse(https://resiliparse.chatnoir.eu/en/stable/) for content extraction is extremely fast and has sufficient metrics. I haven't benchmarked the language extraction yet, but if it's not suburb, I'll take a deeper look at the code here. Or there is the possibility to replace with resiliparse lang detection in this module.

adbar · 2022-10-24T17:30:15Z

I'm not aware of such a benchmark (HTML lang vs. actual language) but I'd also be curious.

Please keep me updated with the extraction benchmark, I'm interested!

I can also take care of the implementation but if you have something specific in mind (like the resiliparse code you're taking about) feel free to make a PR.

semoal · 2024-02-01T08:49:05Z

Hi guys! getorca@3148d9f
Would be amazing if we could add this feature :)
If you don't mind I can create a pull-request

adbar · 2024-02-01T11:33:02Z

Yes that sounds nice, feel free to write a PR.

adbar added the enhancement New feature or request label Jul 19, 2022

adbar referenced this issue in mediacloud/metadata-lib Jul 20, 2022

infer language from HTML metadata, fallback to guess

3231535

adbar added a commit that referenced this issue Sep 7, 2022

metadata: add language when detector is present (partly #224)

6f6acf4

adbar added this to the v2.0 milestone Sep 8, 2022

adbar self-assigned this Sep 13, 2022

adbar mentioned this issue Oct 21, 2022

Feature: Language detection #260

Closed

adbar removed this from the v2.0 milestone Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add document language to metadata #224

Add document language to metadata #224

adbar commented Jul 19, 2022

getorca commented Oct 21, 2022 •

edited

Loading

adbar commented Oct 24, 2022

getorca commented Oct 24, 2022

adbar commented Oct 24, 2022

semoal commented Feb 1, 2024

adbar commented Feb 1, 2024

Add document language to metadata #224

Add document language to metadata #224

Comments

adbar commented Jul 19, 2022

getorca commented Oct 21, 2022 • edited Loading

adbar commented Oct 24, 2022

getorca commented Oct 24, 2022

adbar commented Oct 24, 2022

semoal commented Feb 1, 2024

adbar commented Feb 1, 2024

getorca commented Oct 21, 2022 •

edited

Loading