-
-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TIFF with JPEG Compression not readable by Tesseract #540
Comments
I might be wrong, but could that be something like your source image has 4 channels, RGB and alpha, and the writer has some issues with the alpha channel when writing JPEG compression? |
No, i checked that. The BufferedImage Type is TYPE_3BYTE_BGR. |
Could you please attach some samples of the sources and a generated image? |
Here it is. Hopefully this will help to find the issue. "img001.tif" and "img002.tif" are combined to "target.tiff" by the Java Class. I try to use tesseract 4.1.1 on the target.tiff like this: tesseract target.tiff out.pdf pdf |
Sorry, file was to large. Here again. |
The output file looks pretty normal. Theres nothing unexpected there. But I can say that Gimp throws this error for every TIFF with JPEG compression I could find. The images are read normally, so I have no idea what it could interpret as an extra sample. |
I've saved the img001.tif with GIMP as new tiff file with JPEG compression. Then reopened it. There is no Warning. So this is not always the case. |
Thanks guys for looking into this! @keinhaar Can you attach the same image (target.tiff, with both pages), but after re-saving with GIMP, so I can have a look at the differences? I don't understand the error message from GIMP either, as the TIFF structure has Opens fine in all the tools I have available. But... There's always the chance that we have missed something. -- |
target-gimp.zip |
I think Schmidor did it already, but too be safe... |
Okay... So GIMP is a bit more sophisticated than our writer, in that it writes JPEGTables and Strips (and stores a lot of "unnecessary" extra information, like document name, thumbnail, Exif and sRGB ICC profile). But the main differences are it uses photometric RGB, and stores 4 components, where the extra sample is (associated aka premultiplied) alpha, even though the image is fully opaque. I don't know why GIMP does this, or why Tesseract likes this better though... Most software I have, displays these images the same... We could probably add some options to force RGB mode for JPEGs... And I think you should get 4 components with associated alpha with the reader as-is, if you use (Side note: Despite all the extra information, the GIMP file is about half the size of ours... Probably due to higher JPEG compression, but might be worth looking into...) -- |
Okay, I think I found the bug in the Gimp code: file-tiff-load.c:262. It wrongly assumes (from the comment):
That is, it ignores YCbCr (like in our case), Separated (CMYK) and CIELab that have multiple channels... It seems the only problem is the warning tough, the files (as you mentioned) otherwise loads just fine. Update: Filed GIMP issue 5081. -- |
Thanks for this deep insights. I tried to use other Color Model as mentioned, but it gives an Error when writing the final tiff. Exception in thread "main" javax.imageio.IIOException: Invalid argument to native writeImage at com.sun.imageio.plugins.jpeg.JPEGImageWriter.writeImage(Native Method) at com.sun.imageio.plugins.jpeg.JPEGImageWriter.writeOnThread(JPEGImageWriter.java:1067) at com.sun.imageio.plugins.jpeg.JPEGImageWriter.write(JPEGImageWriter.java:363) at com.twelvemonkeys.imageio.plugins.jpeg.JPEGImageWriter.write(JPEGImageWriter.java:162) at com.twelvemonkeys.imageio.plugins.tiff.TIFFImageWriter.writePage(TIFFImageWriter.java:245) at com.twelvemonkeys.imageio.plugins.tiff.TIFFImageWriter.writeToSequence(TIFFImageWriter.java:954) at de.exware.scan.TiffTool.concatTiffs(TiffTool.java:52) at de.exware.scan.TiffTool.main(TiffTool.java:61) |
@keinhaar Thanks for trying that out. Maybe you could post your code as a failing test case, and I'll see if this is something that can be fixed? And yes, ultimately, JPEG read/write is handled by native code, which for any Oracle JVM is a modified libJPEG AFAIK. Usually, we can get around those issues by writing a raster instead of the full image, and just populating the metadata correctly ourselves (like I did for CMYK JPEG read/write). -- |
The code is still the same as in sample.zip. I just created an new buffered image of the type you requested, and drawed the original image with the g2d context. |
If i create an multipage TIFF with JPEG Compression, it will not be Readable by Tesseract.
It gives this Error:
"Error in pixReadFromTiffStream: bad tiff file: tiffbpl is too small"
Other Compressions like LZW or Deflate work just fine.
Also GIMP gives an Error, but still opens the TIFF. I'll try to Translate the message, because my GIMP is set to german. Something like "Incompatible TIFF: Additional Channels without Field ExtraSamples"
My code looks like this
Is there something wrong with my code?
The text was updated successfully, but these errors were encountered: