Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small difference in compressionratio #31

Open
rmast opened this issue Nov 29, 2021 · 9 comments
Open

Small difference in compressionratio #31

rmast opened this issue Nov 29, 2021 · 9 comments

Comments

@rmast
Copy link

rmast commented Nov 29, 2021

See my post after the closed #30

There is a small compression-ratio difference between your and my setup. Could that signal a memory-leak, or some other difference in setup?

@MerlijnWajer
Copy link
Collaborator

Thanks for the detailed explanation in the other issue, hope it wasn't too painful to get it all going (on Ubuntu I got jbig2enc from https://notesalexp.org/packages/en/bionic/amd64/jbig2enc/download.html, on Gentoo - my main system, it's just in the package manager). Going forward I will try to make a pyinstaller self contained binary so that it will be easier to use the program on Linux (or even Windows/OS X).

Differences in compression ratio might not signal a problem. There could be a few reasons:

  • Different kakadu versions compress slightly differently (maybe)
  • Different jbig2enc versions compress slightly differently (I currently use 0.28, not the newer 0.29, I just haven't upgraded yet)
  • Different compression parameters (unlikely)
  • Different PDF metadata (again unlikely, you don't specify it)
  • Slightly different hOCR input / text data (again unlikely)

I just upgraded my jbig2enc to 0.29 and it doesn't make a difference. If you can share your output PDF with me I can compare. I used the files from your Documents.zip.

@MerlijnWajer
Copy link
Collaborator

It is very unlikely that the problem is a memory leak, for what it's worth.

@MerlijnWajer
Copy link
Collaborator

outfa.pdf

Here is the PDF I get, for what it is worth.

@rmast
Copy link
Author

rmast commented Nov 30, 2021

The exact download of Kakadu I used is specified in my commands. I'll compare the contained jp2-images when I get the chance.

@rmast
Copy link
Author

rmast commented Nov 30, 2021

The relevent part of Kakadu consists of kdu_compress, kdu_extract and one shared library you can find with

ldd `which kdu_extract`

@MerlijnWajer
Copy link
Collaborator

I tried kakadu 8.0.5 as opposed to 8.0.3 that I had and the result is the same, I get the same PDF, the only difference is the kakadu version encoded in tjhe JPEG2000, XMP metadata and PDF IDs, compression ratio is still 39.780225.

Maybe it's just a floating point thing. If you can share the PDF I can diff, or look at the differences between the one I shared and yours. diff -a foo.pdf bar.pdf might help. You can also use the pdfimagesmrc tool that comes with this software, but by default it rounds off the image sizes to two digits.

@rmast
Copy link
Author

rmast commented Nov 30, 2021

This is the pdf from the build I made:
outfa.pdf

@rmast
Copy link
Author

rmast commented Nov 30, 2021

There are differences between your file and mine in the size of the pictures:

oem@Robert:~/vergelijk$ pdfimages -all ../outfa-Merlijn.pdf Merlijn
oem@Robert:~/vergelijk$ pdfimages -all ../outfa.pdf mijn
oem@Robert:~/vergelijk$ ls -al
totaal 58736
drwxrwxr-x  2 oem oem     4096 nov 30 19:33 .
drwxr-xr-x 22 oem oem     4096 nov 30 19:30 ..
-rw-rw-r--  1 oem oem      457 nov 30 19:32 Merlijn-000.jp2
-rw-rw-r--  1 oem oem     6248 nov 30 19:32 Merlijn-001.jp2
-rw-rw-r--  1 oem oem     6077 nov 30 19:32 Merlijn-002.jb2e
-rw-rw-r--  1 oem oem      538 nov 30 19:32 mijn-000.jp2
-rw-rw-r--  1 oem oem     6241 nov 30 19:33 mijn-001.jp2
-rw-rw-r--  1 oem oem     6077 nov 30 19:33 mijn-002.jb2e

@MerlijnWajer
Copy link
Collaborator

Well, it looks like they are clearly different. It is possible that the sauvola binarisation code results in slightly different masks (since it's compiled with -Ofast), but then I would have expected the jb2e to be of a different size, which it is not. The other Cython code uses ints only.

The current code also has an option to dump these items to a directory when creating the PDF (--out-dir), but it only stores the compressed JPEG2000 files, so that's not useful, since we can just get those from the PDF. I will have to change the code to also dump the files as png (or similar) so that we can see if the files are different before being encoded to JPEG2000.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants