Workflow OCR destroys PDF files #79

kwisatz · 2021-12-06T16:48:16Z

After postCreate ran Workflow OCR on PDF files on our instance, they all have a size of 0kb. I've posted details here: nextcloud/server#30059

After deleting the flow, PDFs no longer are getting corrupted.

R0Wi · 2021-12-06T18:30:59Z

HI @kwisatz and thanks for reporting this. Unfortunately you didn't attach any Nexcloud logfiles to your mentioned issue so it's really hard to say whats going wrong under the hood. What i could imagine is that the ocrmypdf command triggered by the app exits with code 0 making the app think everything went okay but produces some internal error so that no file output was generated.

Please decrease your loglevel to 2, reproduce the issue and paste some snippets of your data/nextcloud.log here. If you don't want to wait for the next cron tick you can again enable the postCreate workflow, upload a matching file and then run sudo -u www-data php cron.php manually.

Btw: the app always produces a new file version so the original files aren't deleted. They can be rolled back by using the file history, see README.md for further details.

Signed-off-by: Robin Windey <[email protected]>

SKB-CGN · 2021-12-08T07:26:27Z

Hi,
i have the same problem with NC 22.2.3. The following error is in the Log-File:

{"reqId":"zZMVtK3R4BSqDQiogDWW","level":2,"time":"2021-12-07T14:00:05+01:00","remoteAddr":"","user":"-greyed-out-","app":"workflow_ocr","method":"","url":"--","message":"OCRmyPDF succeeded with warning(s): {stdErr}, {errorOutput}","userAgent":"--","version":"22.2.3.0","id":"61b05c5186932"}

The original PDF file is restorable via history, but the current file is corrupt. It is empty.

Currently every Scan i upload, needs to be restored via the History.

Furthermore i receive lots of errors, that Imagick is not able to create a preview of the file, which is clear, because the file has 0 bytes.

R0Wi · 2021-12-08T08:14:57Z

Thanks for your feedback. Unfortunately it seems like we have an error inside the logging function so that currently the contents of stdErr and errorOutput produced ocrmypdf are not logged correctly. But what can be said is that ocrmypdf exited with printing something to the stdErr which might mean that there was an error while processing the file. Unfortunately ocrmypdf not only prints error but also warning to the stdErr so a non-empty stdErr might also mean that the process exited correctly with some warnings.

My proposal:

In the future we won't accept empty results produced by ocrmypdf. If the result is empty we'll log a warning and won't create a new file version (so not touching the original file then).
Log the values correctly so that the log contains the errormessage or warning printed by ocrmypdf.

Does that make sense to you?

kwisatz · 2021-12-08T08:30:36Z

Sounds plausible to me @R0Wi. Sorry I haven't been able to provide any detailed logs so far, we're pretty busy at the moment. I'll try to supply some this afternoon in case I can actually add some value by logs different from those that @StefCGN supplied.

Signed-off-by: Robin Windey <[email protected]>

SKB-CGN · 2021-12-08T08:34:15Z

If i am able to help you somehow, please let me know.

Currently, i updated to Nextcloud 23.0.0 and i wanted to re-enable the Wofkflow, but this results in some error as well.

First, if you choose the "if", for the workflow, it doesnt display the correct values:

Then, when trying to save it, it throws the following error:
"Configuration is invalid" - "Regular expression is invalid"

Signed-off-by: Robin Windey <[email protected]>

R0Wi · 2021-12-08T08:38:49Z

@StefCGN i think this is a know bug in the workflow base and not directly related to this app, see #41 for details.

@StefCGN @kwisatz thank's for your support. I'll patch the proposed changes and inform you if a new version is available. Would be happy if you could give me some feedback then 👍

Signed-off-by: Robin Windey <[email protected]>

R0Wi · 2021-12-08T08:58:03Z

Fix is now available in versions v1.22.4 and v1.23.1, see https://apps.nextcloud.com/apps/workflow_ocr. Glad to hear your results 😎

SKB-CGN · 2021-12-09T07:12:40Z

Hi,
installed the new App.

First test produces the following error:

`[workflow_ocr] Warnung: OCRmyPDF succeeded with warning(s): A decompression bomb error was encountered while executing the pipeline. Use the argument --max-image-mpixels to raise the maximum image pixel limit.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 189, in exec_page_sync
ocr_image, preprocess_out = make_intermediate_images(
File "/usr/local/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 115, in make_intermediate_images
rasterize_out = rasterize(
File "/usr/local/lib/python3.8/site-packages/ocrmypdf/_pipeline.py", line 453, in rasterize
page_context.plugin_manager.hook.rasterize_pdf_page(
File "/usr/local/lib/python3.8/site-packages/pluggy/hooks.py", line 286, in call
return self._hookexec(self, self.get_hookimpls(), kwargs)
File "/usr/local/lib/python3.8/site-packages/pluggy/manager.py", line 93, in _hookexec
return self._inner_hookexec(hook, methods, kwargs)
File "/usr/local/lib/python3.8/site-packages/pluggy/manager.py", line 84, in
self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
File "/usr/local/lib/python3.8/site-packages/pluggy/callers.py", line 208, in _multicall
return outcome.get_result()
File "/usr/local/lib/python3.8/site-packages/pluggy/callers.py", line 80, in get_result
raise ex[1].with_traceback(ex[2])
File "/usr/local/lib/python3.8/site-packages/pluggy/callers.py", line 187, in _multicall
res = hook_impl.function(*args)
File "/usr/local/lib/python3.8/site-packages/ocrmypdf/builtin_plugins/ghostscript.py", line 68, in rasterize_pdf_page
ghostscript.rasterize_pdf(
File "/usr/local/lib/python3.8/site-packages/ocrmypdf/_exec/ghostscript.py", line 124, in rasterize_pdf
with Image.open(BytesIO(p.stdout)) as im:
File "/usr/local/lib/python3.8/site-packages/PIL/Image.py", line 2953, in open
im = _open_core(fp, filename, prefix, formats)
File "/usr/local/lib/python3.8/site-packages/PIL/Image.py", line 2940, in _open_core
_decompression_bomb_check(im.size)
File "/usr/local/lib/python3.8/site-packages/PIL/Image.py", line 2849, in _decompression_bomb_check
raise DecompressionBombError(
PIL.Image.DecompressionBombError: Image size (278355200 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 375, in run_pipeline
exec_concurrent(context, executor)
File "/usr/local/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 271, in exec_concurrent
executor(
File "/usr/local/lib/python3.8/site-packages/ocrmypdf/_concurrent.py", line 82, in call
self._execute(
File "/usr/local/lib/python3.8/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 134, in _execute
for result in results:
File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 868, in next
raise value
PIL.Image.DecompressionBombError: Image size (278355200 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack.,`

And this one:
[workflow_ocr] Fehler: OCR for file /[email protected]/files/Dokumente/Test/2020-03-23_LOW_TE (007).pdf not possible. Message: OCRmyPDF did not produce any output

SKB-CGN · 2021-12-09T07:23:28Z

Second Test:
[workflow_ocr] Fehler: OCR for file /-greyed-out-/files/Dokumente/Test/Fax.pdf not possible. Message: OCRmyPDF did not produce any output

[workflow_ocr] Warnung: OCRmyPDF succeeded with warning(s): 2 [tesseract] Error opening data file ./eng.traineddata SubprocessOutputError

R0Wi · 2021-12-09T08:19:09Z

Okay so it seems like OCRmyPDF has problems to process your specific file. The

OCRmyPDF did not produce any output

error message is always the last error message written if the app doesn't receive any output from OCRmyPDF. The more interesting line to me seems

PIL.Image.DecompressionBombError: Image size (278355200 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack

So it might be, that the default settings are set too low for your file. Could you try to manually execute the ocrmypdf-command on your system with the same file? Does it succeed? Did you try playing around with the mentioned --max-image-mpixels-parameter?

I can also offer to inspect your PDF file if it's possible to send it over.

EDIT: could be related to ocrmypdf/OCRmyPDF#413

kwisatz · 2021-12-09T08:43:13Z

Here's output from my tests with the latest release. I think the message is pretty obvious.
In any case, I think the move to not replace the PDF in case of an error was the smart one to do here, regardless of what is causing errors in the first place.

{"reqId":"AlD9sEoaxJK2iB2vd2xk","level":2,"time":"2021-12-09T08:40:12+00:00","remoteAddr":"","user":"08ece906-d7b8-1035-9c8d-97c38bc36d8d","app":"workflow_ocr","method":"","url":"--","message":"OCRmyPDF succeeded with warning(s): sh: 1: ocrmypdf: not found, ","userAgent":"--","version":"22.2.2.0"}
{"reqId":"AlD9sEoaxJK2iB2vd2xk","level":3,"time":"2021-12-09T08:40:12+00:00","remoteAddr":"","user":"08ece906-d7b8-1035-9c8d-97c38bc36d8d","app":"workflow_ocr","method":"","url":"--","message":"OCR for file /08ece906-d7b8-1035-9c8d-97c38bc36d8d/files/form-experiments.pdf not possible. Message: OCRmyPDF did not produce any output","userAgent":"--","version":"22.2.2.0"}                                       
{"reqId":"AifTKq4NuPN1ARrxKBAx","level":2,"time":"2021-12-09T08:40:13+00:00","remoteAddr":"","user":"08ece906-d7b8-1035-9c8d-97c38bc36d8d","app":"workflow_ocr","method":"","url":"--","message":"OCRmyPDF succeeded with warning(s): sh: 1: ocrmypdf: not found, ","userAgent":"--","version":"22.2.2.0"}
{"reqId":"AifTKq4NuPN1ARrxKBAx","level":3,"time":"2021-12-09T08:40:13+00:00","remoteAddr":"","user":"08ece906-d7b8-1035-9c8d-97c38bc36d8d","app":"workflow_ocr","method":"","url":"--","message":"OCR for file /08ece906-d7b8-1035-9c8d-97c38bc36d8d/files/Projects/someProject/CITP88D.PDF not possible. Message: OCRmyPDF did not produce any output","userAgent":"--","version":"22.2.2.0"}                      
{"reqId":"AlD9sEoaxJK2iB2vd2xk","level":2,"time":"2021-12-09T08:40:13+00:00","remoteAddr":"","user":"08ece906-d7b8-1035-9c8d-97c38bc36d8d","app":"workflow_ocr","method":"","url":"--","message":"OCRmyPDF succeeded with warning(s): sh: 1: ocrmypdf: not found, ","userAgent":"--","version":"22.2.2.0"}
{"reqId":"AlD9sEoaxJK2iB2vd2xk","level":3,"time":"2021-12-09T08:40:13+00:00","remoteAddr":"","user":"08ece906-d7b8-1035-9c8d-97c38bc36d8d","app":"workflow_ocr","method":"","url":"--","message":"OCR for file /08ece906-d7b8-1035-9c8d-97c38bc36d8d/files/Projects/someProject/AFF_SECT.PDF not possible. Message: OCRmyPDF did not produce any output","userAgent":"--","version":"22.2.2.0"}                     
{"reqId":"AifTKq4NuPN1ARrxKBAx","level":2,"time":"2021-12-09T08:40:13+00:00","remoteAddr":"","user":"08ece906-d7b8-1035-9c8d-97c38bc36d8d","app":"workflow_ocr","method":"","url":"--","message":"OCRmyPDF succeeded with warning(s): sh: 1: ocrmypdf: not found, ","userAgent":"--","version":"22.2.2.0"}
{"reqId":"AifTKq4NuPN1ARrxKBAx","level":3,"time":"2021-12-09T08:40:13+00:00","remoteAddr":"","user":"08ece906-d7b8-1035-9c8d-97c38bc36d8d","app":"workflow_ocr","method":"","url":"--","message":"OCR for file /08ece906-d7b8-1035-9c8d-97c38bc36d8d/files/Projects/someProject/CITP88FR.PDF not possible. Message: OCRmyPDF did not produce any output","userAgent":"--","version":"22.2.2.0"}

I know this is a little off-topic but where or when should ocrmypdf get installed or users instructed to install it manually? I just installed it manually on my server, but I checked the app details which don't specify this being necessary?

In any case it seems to after installing ocrmypdf on the server, the Workflow OCR works perfectly.

SKB-CGN · 2021-12-09T08:57:44Z

I think, the document is "too big", because its a house plan with drawings.
But, i checked a smaller file, which was scanned from a letter.

The output is:
[workflow_ocr] Warnung: OCRmyPDF succeeded with warning(s): 2 [tesseract] Error opening data file ./eng.traineddata SubprocessOutputError,
[workflow_ocr] Fehler: OCR for file /-greyed-out-/files/Dokumente/Test/2020_11_30_Kellerdecke.pdf not possible. Message: OCRmyPDF did not produce any output

I dont know, why this is occuring. I am working on a FreeBSD platform inside a TrueNAS system
find / -name eng.traineddata /usr/local/share/tessdata/eng.traineddata

The file is there, but i think, it is looking in a wrong path.

Perhaps, do you know, how to correct that?
Ok, so i managed, to get the path working.
env TESSDATA_PREFIX=/usr/local/share/tessdata

Now, a new PDF is generated, but the text is not selectable.

The following error shows in Nextcloud Log:
[workflow_ocr] Warnung: OCRmyPDF succeeded with warning(s): 2 [tesseract] read_params_file: Can't open pdf 2 [tesseract] read_params_file: Can't open txt 1 [tesseract] read_params_file: Can't open pdf 1 [tesseract] read_params_file: Can't open txt,

R0Wi · 2021-12-09T09:20:58Z

@kwisatz the README mentiones that ocrmypdf is necessary in the backend. Which "app-details" are missing for you? Maybe i missed something and can improve the docs 😄

@StefCGN seems like a problem inside your server setup but i can't say whats wrong. The only thing i would test is executing the ocrmypdf command manually via commandline. You could also enable the -v-switch to get additional output: ocrmypdf -v --redo-ocr input.pdf output.pdf. Maybe ocrmypdf/OCRmyPDF#209 helps? Sorry but this is not direcly related to this app but rather to the ocrmypdf-commandline-tool 😞

kwisatz · 2021-12-09T09:23:09Z

@R0Wi As a Nextcloud user who can install apps through the Web UI, I would expect a mention of dependencies here:

R0Wi · 2021-12-09T09:24:32Z

@R0Wi As a Nextcloud user who can install apps through the Web UI, I would expect a mention of dependencies here:

I see, you're right 👍 Will update the docs in the next release.

SKB-CGN · 2021-12-09T09:45:55Z

@StefCGN seems like a problem inside your server setup but i can't say whats wrong. The only thing i would test is executing the ocrmypdf command manually via commandline. You could also enable the -v-switch to get additional output: ocrmypdf -v --redo-ocr input.pdf output.pdf. Maybe jbarlow83/OCRmyPDF#209 helps? Sorry but this is not direcly related to this app but rather to the ocrmypdf-commandline-tool 😞

Olé olé :)

I just downloaded the configs directory coming from Github into /tessdata/ and the PDF is created without any errors :)

Will open up an issue with tesseract!

Thanks for pointing me somehow into the right direction.

R0Wi mentioned this issue Dec 6, 2021

All PDF files uploaded to Nextcloud downloaded with content-length: 0 nextcloud/server#30059

Closed

R0Wi self-assigned this Dec 6, 2021

R0Wi added the bug Something isn't working label Dec 6, 2021

R0Wi added a commit that referenced this issue Dec 6, 2021

Do not accept empty OCRmyPDF results #79

4ca5e0a

Signed-off-by: Robin Windey <[email protected]>

R0Wi linked a pull request Dec 6, 2021 that will close this issue

Do not accept empty OCRmyPDF results #79 #80

Merged

R0Wi added a commit that referenced this issue Dec 6, 2021

Do not accept empty OCRmyPDF results #79

d9b0132

Signed-off-by: Robin Windey <[email protected]>

R0Wi added a commit that referenced this issue Dec 8, 2021

Do not accept empty OCRmyPDF results and fix log #79

3db2caf

Signed-off-by: Robin Windey <[email protected]>

R0Wi added a commit that referenced this issue Dec 8, 2021

Do not accept empty OCRmyPDF results and fix log #79

3a6a4bf

Signed-off-by: Robin Windey <[email protected]>

R0Wi closed this as completed in #80 Dec 8, 2021

R0Wi added a commit that referenced this issue Dec 8, 2021

Do not accept empty OCRmyPDF results and fix log #79 (#80)

6b99bcb

Signed-off-by: Robin Windey <[email protected]>

R0Wi added a commit that referenced this issue Dec 8, 2021

Do not accept empty OCRmyPDF results and fix log #79 (#80)

b5b07d8

Signed-off-by: Robin Windey <[email protected]>

R0Wi added a commit that referenced this issue Dec 8, 2021

Do not accept empty OCRmyPDF results and fix log #79 (#80)

b54c44e

Signed-off-by: Robin Windey <[email protected]>

R0Wi added a commit that referenced this issue Dec 8, 2021

Do not accept empty OCRmyPDF results and fix log #79 (#80) (#81)

7e46236

Signed-off-by: Robin Windey <[email protected]>

R0Wi added a commit that referenced this issue Dec 8, 2021

Do not accept empty OCRmyPDF results and fix log #79 (#80) (#82)

a48b192

Signed-off-by: Robin Windey <[email protected]>

R0Wi added this to the v1.22.4 milestone Dec 8, 2021

R0Wi mentioned this issue Dec 9, 2021

Improve systemchecks #83

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow OCR destroys PDF files #79

Workflow OCR destroys PDF files #79

kwisatz commented Dec 6, 2021

R0Wi commented Dec 6, 2021 •

edited

Loading

SKB-CGN commented Dec 8, 2021

R0Wi commented Dec 8, 2021

kwisatz commented Dec 8, 2021

SKB-CGN commented Dec 8, 2021

R0Wi commented Dec 8, 2021

R0Wi commented Dec 8, 2021

SKB-CGN commented Dec 9, 2021

SKB-CGN commented Dec 9, 2021

R0Wi commented Dec 9, 2021 •

edited

Loading

kwisatz commented Dec 9, 2021 •

edited

Loading

SKB-CGN commented Dec 9, 2021 •

edited

Loading

R0Wi commented Dec 9, 2021

kwisatz commented Dec 9, 2021

R0Wi commented Dec 9, 2021

SKB-CGN commented Dec 9, 2021

Workflow OCR destroys PDF files #79

Workflow OCR destroys PDF files #79

Comments

kwisatz commented Dec 6, 2021

R0Wi commented Dec 6, 2021 • edited Loading

SKB-CGN commented Dec 8, 2021

R0Wi commented Dec 8, 2021

kwisatz commented Dec 8, 2021

SKB-CGN commented Dec 8, 2021

R0Wi commented Dec 8, 2021

R0Wi commented Dec 8, 2021

SKB-CGN commented Dec 9, 2021

SKB-CGN commented Dec 9, 2021

R0Wi commented Dec 9, 2021 • edited Loading

kwisatz commented Dec 9, 2021 • edited Loading

SKB-CGN commented Dec 9, 2021 • edited Loading

R0Wi commented Dec 9, 2021

kwisatz commented Dec 9, 2021

R0Wi commented Dec 9, 2021

SKB-CGN commented Dec 9, 2021

R0Wi commented Dec 6, 2021 •

edited

Loading

R0Wi commented Dec 9, 2021 •

edited

Loading

kwisatz commented Dec 9, 2021 •

edited

Loading

SKB-CGN commented Dec 9, 2021 •

edited

Loading