Small difference in compressionratio #31

rmast · 2021-11-29T22:43:41Z

See my post after the closed #30

There is a small compression-ratio difference between your and my setup. Could that signal a memory-leak, or some other difference in setup?

MerlijnWajer · 2021-11-29T23:05:04Z

Thanks for the detailed explanation in the other issue, hope it wasn't too painful to get it all going (on Ubuntu I got jbig2enc from https://notesalexp.org/packages/en/bionic/amd64/jbig2enc/download.html, on Gentoo - my main system, it's just in the package manager). Going forward I will try to make a pyinstaller self contained binary so that it will be easier to use the program on Linux (or even Windows/OS X).

Differences in compression ratio might not signal a problem. There could be a few reasons:

Different kakadu versions compress slightly differently (maybe)
Different jbig2enc versions compress slightly differently (I currently use 0.28, not the newer 0.29, I just haven't upgraded yet)
Different compression parameters (unlikely)
Different PDF metadata (again unlikely, you don't specify it)
Slightly different hOCR input / text data (again unlikely)

I just upgraded my jbig2enc to 0.29 and it doesn't make a difference. If you can share your output PDF with me I can compare. I used the files from your Documents.zip.

MerlijnWajer · 2021-11-29T23:08:01Z

It is very unlikely that the problem is a memory leak, for what it's worth.

MerlijnWajer · 2021-11-29T23:08:21Z

outfa.pdf

Here is the PDF I get, for what it is worth.

rmast · 2021-11-30T05:53:43Z

The exact download of Kakadu I used is specified in my commands. I'll compare the contained jp2-images when I get the chance.

rmast · 2021-11-30T05:55:22Z

The relevent part of Kakadu consists of kdu_compress, kdu_extract and one shared library you can find with

ldd `which kdu_extract`

MerlijnWajer · 2021-11-30T10:49:42Z

I tried kakadu 8.0.5 as opposed to 8.0.3 that I had and the result is the same, I get the same PDF, the only difference is the kakadu version encoded in tjhe JPEG2000, XMP metadata and PDF IDs, compression ratio is still 39.780225.

Maybe it's just a floating point thing. If you can share the PDF I can diff, or look at the differences between the one I shared and yours. diff -a foo.pdf bar.pdf might help. You can also use the pdfimagesmrc tool that comes with this software, but by default it rounds off the image sizes to two digits.

rmast · 2021-11-30T18:27:53Z

This is the pdf from the build I made:
outfa.pdf

rmast · 2021-11-30T18:36:53Z

There are differences between your file and mine in the size of the pictures:

oem@Robert:~/vergelijk$ pdfimages -all ../outfa-Merlijn.pdf Merlijn
oem@Robert:~/vergelijk$ pdfimages -all ../outfa.pdf mijn
oem@Robert:~/vergelijk$ ls -al
totaal 58736
drwxrwxr-x  2 oem oem     4096 nov 30 19:33 .
drwxr-xr-x 22 oem oem     4096 nov 30 19:30 ..
-rw-rw-r--  1 oem oem      457 nov 30 19:32 Merlijn-000.jp2
-rw-rw-r--  1 oem oem     6248 nov 30 19:32 Merlijn-001.jp2
-rw-rw-r--  1 oem oem     6077 nov 30 19:32 Merlijn-002.jb2e
-rw-rw-r--  1 oem oem      538 nov 30 19:32 mijn-000.jp2
-rw-rw-r--  1 oem oem     6241 nov 30 19:33 mijn-001.jp2
-rw-rw-r--  1 oem oem     6077 nov 30 19:33 mijn-002.jb2e

MerlijnWajer · 2021-11-30T18:43:10Z

Well, it looks like they are clearly different. It is possible that the sauvola binarisation code results in slightly different masks (since it's compiled with -Ofast), but then I would have expected the jb2e to be of a different size, which it is not. The other Cython code uses ints only.

The current code also has an option to dump these items to a directory when creating the PDF (--out-dir), but it only stores the compressed JPEG2000 files, so that's not useful, since we can just get those from the PDF. I will have to change the code to also dump the files as png (or similar) so that we can see if the files are different before being encoded to JPEG2000.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small difference in compressionratio #31

Small difference in compressionratio #31

rmast commented Nov 29, 2021

MerlijnWajer commented Nov 29, 2021

MerlijnWajer commented Nov 29, 2021

MerlijnWajer commented Nov 29, 2021

rmast commented Nov 30, 2021

rmast commented Nov 30, 2021 •

edited

Loading

MerlijnWajer commented Nov 30, 2021

rmast commented Nov 30, 2021

rmast commented Nov 30, 2021

MerlijnWajer commented Nov 30, 2021

Small difference in compressionratio #31

Small difference in compressionratio #31

Comments

rmast commented Nov 29, 2021

MerlijnWajer commented Nov 29, 2021

MerlijnWajer commented Nov 29, 2021

MerlijnWajer commented Nov 29, 2021

rmast commented Nov 30, 2021

rmast commented Nov 30, 2021 • edited Loading

MerlijnWajer commented Nov 30, 2021

rmast commented Nov 30, 2021

rmast commented Nov 30, 2021

MerlijnWajer commented Nov 30, 2021

rmast commented Nov 30, 2021 •

edited

Loading