Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow OCR destroys PDF files #79

Closed
kwisatz opened this issue Dec 6, 2021 · 16 comments · Fixed by #80
Closed

Workflow OCR destroys PDF files #79

kwisatz opened this issue Dec 6, 2021 · 16 comments · Fixed by #80
Assignees
Labels
bug Something isn't working
Milestone

Comments

@kwisatz
Copy link

kwisatz commented Dec 6, 2021

After postCreate ran Workflow OCR on PDF files on our instance, they all have a size of 0kb. I've posted details here: nextcloud/server#30059

After deleting the flow, PDFs no longer are getting corrupted.

@R0Wi
Copy link
Contributor

R0Wi commented Dec 6, 2021

HI @kwisatz and thanks for reporting this. Unfortunately you didn't attach any Nexcloud logfiles to your mentioned issue so it's really hard to say whats going wrong under the hood. What i could imagine is that the ocrmypdf command triggered by the app exits with code 0 making the app think everything went okay but produces some internal error so that no file output was generated.

Please decrease your loglevel to 2, reproduce the issue and paste some snippets of your data/nextcloud.log here. If you don't want to wait for the next cron tick you can again enable the postCreate workflow, upload a matching file and then run sudo -u www-data php cron.php manually.

Btw: the app always produces a new file version so the original files aren't deleted. They can be rolled back by using the file history, see README.md for further details.

@R0Wi R0Wi self-assigned this Dec 6, 2021
@R0Wi R0Wi added the bug Something isn't working label Dec 6, 2021
R0Wi added a commit that referenced this issue Dec 6, 2021
@R0Wi R0Wi linked a pull request Dec 6, 2021 that will close this issue
R0Wi added a commit that referenced this issue Dec 6, 2021
@SKB-CGN
Copy link

SKB-CGN commented Dec 8, 2021

Hi,
i have the same problem with NC 22.2.3. The following error is in the Log-File:

{"reqId":"zZMVtK3R4BSqDQiogDWW","level":2,"time":"2021-12-07T14:00:05+01:00","remoteAddr":"","user":"-greyed-out-","app":"workflow_ocr","method":"","url":"--","message":"OCRmyPDF succeeded with warning(s): {stdErr}, {errorOutput}","userAgent":"--","version":"22.2.3.0","id":"61b05c5186932"}

The original PDF file is restorable via history, but the current file is corrupt. It is empty.

Currently every Scan i upload, needs to be restored via the History.

Furthermore i receive lots of errors, that Imagick is not able to create a preview of the file, which is clear, because the file has 0 bytes.

@R0Wi
Copy link
Contributor

R0Wi commented Dec 8, 2021

Thanks for your feedback. Unfortunately it seems like we have an error inside the logging function so that currently the contents of stdErr and errorOutput produced ocrmypdf are not logged correctly. But what can be said is that ocrmypdf exited with printing something to the stdErr which might mean that there was an error while processing the file. Unfortunately ocrmypdf not only prints error but also warning to the stdErr so a non-empty stdErr might also mean that the process exited correctly with some warnings.

My proposal:

  • In the future we won't accept empty results produced by ocrmypdf. If the result is empty we'll log a warning and won't create a new file version (so not touching the original file then).
  • Log the values correctly so that the log contains the errormessage or warning printed by ocrmypdf.

Does that make sense to you?

@kwisatz
Copy link
Author

kwisatz commented Dec 8, 2021

Sounds plausible to me @R0Wi. Sorry I haven't been able to provide any detailed logs so far, we're pretty busy at the moment. I'll try to supply some this afternoon in case I can actually add some value by logs different from those that @StefCGN supplied.

R0Wi added a commit that referenced this issue Dec 8, 2021
@SKB-CGN
Copy link

SKB-CGN commented Dec 8, 2021

If i am able to help you somehow, please let me know.

Currently, i updated to Nextcloud 23.0.0 and i wanted to re-enable the Wofkflow, but this results in some error as well.

First, if you choose the "if", for the workflow, it doesnt display the correct values:
2021-12-08 09_32_19-Window

Then, when trying to save it, it throws the following error:
"Configuration is invalid" - "Regular expression is invalid"
2021-12-08 09_33_29-Window

R0Wi added a commit that referenced this issue Dec 8, 2021
@R0Wi
Copy link
Contributor

R0Wi commented Dec 8, 2021

@StefCGN i think this is a know bug in the workflow base and not directly related to this app, see #41 for details.

@StefCGN @kwisatz thank's for your support. I'll patch the proposed changes and inform you if a new version is available. Would be happy if you could give me some feedback then 👍

@R0Wi R0Wi closed this as completed in #80 Dec 8, 2021
R0Wi added a commit that referenced this issue Dec 8, 2021
R0Wi added a commit that referenced this issue Dec 8, 2021
R0Wi added a commit that referenced this issue Dec 8, 2021
R0Wi added a commit that referenced this issue Dec 8, 2021
R0Wi added a commit that referenced this issue Dec 8, 2021
@R0Wi
Copy link
Contributor

R0Wi commented Dec 8, 2021

Fix is now available in versions v1.22.4 and v1.23.1, see https://apps.nextcloud.com/apps/workflow_ocr. Glad to hear your results 😎

@R0Wi R0Wi added this to the v1.22.4 milestone Dec 8, 2021
@SKB-CGN
Copy link

SKB-CGN commented Dec 9, 2021

Hi,
installed the new App.

First test produces the following error:

`[workflow_ocr] Warnung: OCRmyPDF succeeded with warning(s): A decompression bomb error was encountered while executing the pipeline. Use the argument --max-image-mpixels to raise the maximum image pixel limit.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 189, in exec_page_sync
ocr_image, preprocess_out = make_intermediate_images(
File "/usr/local/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 115, in make_intermediate_images
rasterize_out = rasterize(
File "/usr/local/lib/python3.8/site-packages/ocrmypdf/_pipeline.py", line 453, in rasterize
page_context.plugin_manager.hook.rasterize_pdf_page(
File "/usr/local/lib/python3.8/site-packages/pluggy/hooks.py", line 286, in call
return self._hookexec(self, self.get_hookimpls(), kwargs)
File "/usr/local/lib/python3.8/site-packages/pluggy/manager.py", line 93, in _hookexec
return self._inner_hookexec(hook, methods, kwargs)
File "/usr/local/lib/python3.8/site-packages/pluggy/manager.py", line 84, in
self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
File "/usr/local/lib/python3.8/site-packages/pluggy/callers.py", line 208, in _multicall
return outcome.get_result()
File "/usr/local/lib/python3.8/site-packages/pluggy/callers.py", line 80, in get_result
raise ex[1].with_traceback(ex[2])
File "/usr/local/lib/python3.8/site-packages/pluggy/callers.py", line 187, in _multicall
res = hook_impl.function(*args)
File "/usr/local/lib/python3.8/site-packages/ocrmypdf/builtin_plugins/ghostscript.py", line 68, in rasterize_pdf_page
ghostscript.rasterize_pdf(
File "/usr/local/lib/python3.8/site-packages/ocrmypdf/_exec/ghostscript.py", line 124, in rasterize_pdf
with Image.open(BytesIO(p.stdout)) as im:
File "/usr/local/lib/python3.8/site-packages/PIL/Image.py", line 2953, in open
im = _open_core(fp, filename, prefix, formats)
File "/usr/local/lib/python3.8/site-packages/PIL/Image.py", line 2940, in _open_core
_decompression_bomb_check(im.size)
File "/usr/local/lib/python3.8/site-packages/PIL/Image.py", line 2849, in _decompression_bomb_check
raise DecompressionBombError(
PIL.Image.DecompressionBombError: Image size (278355200 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 375, in run_pipeline
exec_concurrent(context, executor)
File "/usr/local/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 271, in exec_concurrent
executor(
File "/usr/local/lib/python3.8/site-packages/ocrmypdf/_concurrent.py", line 82, in call
self._execute(
File "/usr/local/lib/python3.8/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 134, in _execute
for result in results:
File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 868, in next
raise value
PIL.Image.DecompressionBombError: Image size (278355200 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack.,`

And this one:
[workflow_ocr] Fehler: OCR for file /[email protected]/files/Dokumente/Test/2020-03-23_LOW_TE (007).pdf not possible. Message: OCRmyPDF did not produce any output

@SKB-CGN
Copy link

SKB-CGN commented Dec 9, 2021

Second Test:
[workflow_ocr] Fehler: OCR for file /-greyed-out-/files/Dokumente/Test/Fax.pdf not possible. Message: OCRmyPDF did not produce any output

[workflow_ocr] Warnung: OCRmyPDF succeeded with warning(s): 2 [tesseract] Error opening data file ./eng.traineddata SubprocessOutputError

@R0Wi
Copy link
Contributor

R0Wi commented Dec 9, 2021

Okay so it seems like OCRmyPDF has problems to process your specific file. The

OCRmyPDF did not produce any output

error message is always the last error message written if the app doesn't receive any output from OCRmyPDF. The more interesting line to me seems

PIL.Image.DecompressionBombError: Image size (278355200 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack

So it might be, that the default settings are set too low for your file. Could you try to manually execute the ocrmypdf-command on your system with the same file? Does it succeed? Did you try playing around with the mentioned --max-image-mpixels-parameter?

I can also offer to inspect your PDF file if it's possible to send it over.

EDIT: could be related to ocrmypdf/OCRmyPDF#413

@kwisatz
Copy link
Author

kwisatz commented Dec 9, 2021

Here's output from my tests with the latest release. I think the message is pretty obvious.
In any case, I think the move to not replace the PDF in case of an error was the smart one to do here, regardless of what is causing errors in the first place.

{"reqId":"AlD9sEoaxJK2iB2vd2xk","level":2,"time":"2021-12-09T08:40:12+00:00","remoteAddr":"","user":"08ece906-d7b8-1035-9c8d-97c38bc36d8d","app":"workflow_ocr","method":"","url":"--","message":"OCRmyPDF succeeded with warning(s): sh: 1: ocrmypdf: not found, ","userAgent":"--","version":"22.2.2.0"}
{"reqId":"AlD9sEoaxJK2iB2vd2xk","level":3,"time":"2021-12-09T08:40:12+00:00","remoteAddr":"","user":"08ece906-d7b8-1035-9c8d-97c38bc36d8d","app":"workflow_ocr","method":"","url":"--","message":"OCR for file /08ece906-d7b8-1035-9c8d-97c38bc36d8d/files/form-experiments.pdf not possible. Message: OCRmyPDF did not produce any output","userAgent":"--","version":"22.2.2.0"}                                       
{"reqId":"AifTKq4NuPN1ARrxKBAx","level":2,"time":"2021-12-09T08:40:13+00:00","remoteAddr":"","user":"08ece906-d7b8-1035-9c8d-97c38bc36d8d","app":"workflow_ocr","method":"","url":"--","message":"OCRmyPDF succeeded with warning(s): sh: 1: ocrmypdf: not found, ","userAgent":"--","version":"22.2.2.0"}
{"reqId":"AifTKq4NuPN1ARrxKBAx","level":3,"time":"2021-12-09T08:40:13+00:00","remoteAddr":"","user":"08ece906-d7b8-1035-9c8d-97c38bc36d8d","app":"workflow_ocr","method":"","url":"--","message":"OCR for file /08ece906-d7b8-1035-9c8d-97c38bc36d8d/files/Projects/someProject/CITP88D.PDF not possible. Message: OCRmyPDF did not produce any output","userAgent":"--","version":"22.2.2.0"}                      
{"reqId":"AlD9sEoaxJK2iB2vd2xk","level":2,"time":"2021-12-09T08:40:13+00:00","remoteAddr":"","user":"08ece906-d7b8-1035-9c8d-97c38bc36d8d","app":"workflow_ocr","method":"","url":"--","message":"OCRmyPDF succeeded with warning(s): sh: 1: ocrmypdf: not found, ","userAgent":"--","version":"22.2.2.0"}
{"reqId":"AlD9sEoaxJK2iB2vd2xk","level":3,"time":"2021-12-09T08:40:13+00:00","remoteAddr":"","user":"08ece906-d7b8-1035-9c8d-97c38bc36d8d","app":"workflow_ocr","method":"","url":"--","message":"OCR for file /08ece906-d7b8-1035-9c8d-97c38bc36d8d/files/Projects/someProject/AFF_SECT.PDF not possible. Message: OCRmyPDF did not produce any output","userAgent":"--","version":"22.2.2.0"}                     
{"reqId":"AifTKq4NuPN1ARrxKBAx","level":2,"time":"2021-12-09T08:40:13+00:00","remoteAddr":"","user":"08ece906-d7b8-1035-9c8d-97c38bc36d8d","app":"workflow_ocr","method":"","url":"--","message":"OCRmyPDF succeeded with warning(s): sh: 1: ocrmypdf: not found, ","userAgent":"--","version":"22.2.2.0"}
{"reqId":"AifTKq4NuPN1ARrxKBAx","level":3,"time":"2021-12-09T08:40:13+00:00","remoteAddr":"","user":"08ece906-d7b8-1035-9c8d-97c38bc36d8d","app":"workflow_ocr","method":"","url":"--","message":"OCR for file /08ece906-d7b8-1035-9c8d-97c38bc36d8d/files/Projects/someProject/CITP88FR.PDF not possible. Message: OCRmyPDF did not produce any output","userAgent":"--","version":"22.2.2.0"}   

I know this is a little off-topic but where or when should ocrmypdf get installed or users instructed to install it manually? I just installed it manually on my server, but I checked the app details which don't specify this being necessary?

In any case it seems to after installing ocrmypdf on the server, the Workflow OCR works perfectly.

@SKB-CGN
Copy link

SKB-CGN commented Dec 9, 2021

I think, the document is "too big", because its a house plan with drawings.
But, i checked a smaller file, which was scanned from a letter.

The output is:
[workflow_ocr] Warnung: OCRmyPDF succeeded with warning(s): 2 [tesseract] Error opening data file ./eng.traineddata SubprocessOutputError,
[workflow_ocr] Fehler: OCR for file /-greyed-out-/files/Dokumente/Test/2020_11_30_Kellerdecke.pdf not possible. Message: OCRmyPDF did not produce any output

I dont know, why this is occuring. I am working on a FreeBSD platform inside a TrueNAS system
find / -name eng.traineddata /usr/local/share/tessdata/eng.traineddata

The file is there, but i think, it is looking in a wrong path.

Perhaps, do you know, how to correct that?
Ok, so i managed, to get the path working.
env TESSDATA_PREFIX=/usr/local/share/tessdata

Now, a new PDF is generated, but the text is not selectable.

The following error shows in Nextcloud Log:
[workflow_ocr] Warnung: OCRmyPDF succeeded with warning(s): 2 [tesseract] read_params_file: Can't open pdf 2 [tesseract] read_params_file: Can't open txt 1 [tesseract] read_params_file: Can't open pdf 1 [tesseract] read_params_file: Can't open txt,

@R0Wi
Copy link
Contributor

R0Wi commented Dec 9, 2021

@kwisatz the README mentiones that ocrmypdf is necessary in the backend. Which "app-details" are missing for you? Maybe i missed something and can improve the docs 😄

@StefCGN seems like a problem inside your server setup but i can't say whats wrong. The only thing i would test is executing the ocrmypdf command manually via commandline. You could also enable the -v-switch to get additional output: ocrmypdf -v --redo-ocr input.pdf output.pdf. Maybe ocrmypdf/OCRmyPDF#209 helps? Sorry but this is not direcly related to this app but rather to the ocrmypdf-commandline-tool 😞

@kwisatz
Copy link
Author

kwisatz commented Dec 9, 2021

@R0Wi As a Nextcloud user who can install apps through the Web UI, I would expect a mention of dependencies here:

Apps-TenTwentyFour-Cloud-Storage

@R0Wi
Copy link
Contributor

R0Wi commented Dec 9, 2021

@R0Wi As a Nextcloud user who can install apps through the Web UI, I would expect a mention of dependencies here:

I see, you're right 👍 Will update the docs in the next release.

@SKB-CGN
Copy link

SKB-CGN commented Dec 9, 2021

@StefCGN seems like a problem inside your server setup but i can't say whats wrong. The only thing i would test is executing the ocrmypdf command manually via commandline. You could also enable the -v-switch to get additional output: ocrmypdf -v --redo-ocr input.pdf output.pdf. Maybe jbarlow83/OCRmyPDF#209 helps? Sorry but this is not direcly related to this app but rather to the ocrmypdf-commandline-tool 😞

Olé olé :)

I just downloaded the configs directory coming from Github into /tessdata/ and the PDF is created without any errors :)

Will open up an issue with tesseract!

Thanks for pointing me somehow into the right direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants