-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata is silently deleted? #327
Comments
When doing PDF/A conversion, Ghostscript rewrites the metadata (the whole PDF, actually). Starting in v7.4.0 I added new features to manage metadata rather than relying on Ghostscript, although I haven't entirely cut Ghostscript out when PDF/A is involved. It's news to me, but unsurprising, that Ghostscript doesn't properly replicate all of the input XMP metadata in its output. Ghostscript does have a legitimate need to change some of the metadata, however. If we turn off Ghostscript and PDF/A conversion with I should mention that PDF metadata is really, really messy. There's two components, one in XML and an older PDF-specific one that has to be retained for backward compatibility. They have to be synchronized as much as possible, but it's not actually possible to represent all of the same data in them. The XMP format itself is an amalgamation of multiple XML specifications and has multiple ways of representing the same information. I will leave this as open as a reminder to fix metadata PDF/A. That will be complicated since it's a 3-way merge in XML of the input file, ocrmypdf's changes to the metadata, and Ghostscript's changes. |
I'm not surprised it's something like that. As you say, PDF is a horrifying family of formats and backwards-compatibility and hacks. But data loss can't be excused by noting that the right thing is hard - that's the one thing above all which a tool like ocrmypdf must not do silently or by default. Until the metadata is preserved, can something be done like erroring out if entire fields are dropped or at least dumping a warning? A heuristic might be something like if the field count is different between start and finish, or if the new metadata is X bytes smaller than the old metadata. Semantically irrelevant minor formatting and legitimate fields shouldn't cause entire fields to be deleted or major changes in metadata size. Then the user is notified and can compare before/after to decide if it's a problem and can re-add metadata manually if they really want the new version with the OCR/compression. At the least, a warning in the man page & manual seems merited? |
Semantic XML diff is not trivial. Semantic XMP diff is more difficult, because some attributes in XMP are shorthand for certain child tags and there are some semantically equivalent or almost-equivalent constructs. I can't also recommend modifying XMP after PDF/A conversion since most programs are incapable of editing PDF/As without breaking conformance. (exiftool among them.) To be clear, Ghostscript transfers most XMP metadata, it just seems to drop a nonstandard add-on you happen to care about. I added something to the documentation. You're welcome to submit a PR if you want to help with this issue in some way. Otherwise, this is open source software and like everyone else, my time is limited. I also provide commercial support for OCRmyPDF, if you need this addressed urgently. |
I looked into this further and found veraPDF reports the following when your input metadata are attached to a PDF/A:
So, Ghostscript had a reason for dropping this metadata: they are not allowed in conformant PDF/A-2b files. The full list of "predefined schemas" veraPDF refers to are reproduced here: As far as I'm aware Dublin Core is inclusive of most or all of the "PRISM" metadata that is attached to this file. To retain the metadata, you'd have to find a tool that can rewrite PRISM as Dublin Core. The next release will issue a warning when some metadata gets lost. |
Thanks. |
Fixed in v8 |
I was experimenting with whether
--skip-text --optimize 3 --jbig2-lossy
might be a good idea for re-processing my PDFs to save what looks like a ton of space, but I noticed that when a PDF is processed,ocrmypdf
drops metadata fields without any warning or instruction?The documentation makes no mention of metadata being deleted that I can find and the man page implies all fields will simply be copied over (because why would anything else be done):
Sample PDF: https://www.gwern.net/docs/aspirin/2011-rothwell.pdf
29 vs 52 fields, including meaningful ones like the URL, ISSN, Volume, Number, etc.
The regular output doesn't mention anything about metadata, and the debug
--verbose 4
output mentions metadata only in passing with nothing about deletion or not preserving metadata:The text was updated successfully, but these errors were encountered: