-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Usefulness of MRC for decent quality compression of scanned book pages with illustrations #33
Comments
I took a look and I have a few thoughts. The damage to the photos comes mostly from parts of the photo being marked as background and others as foreground. Ultimately, MRC is not ideal for photos, but I think we can come up with something that is quite workable if we can figure out what parts are just images.
If we have a good idea of what is text and what is photo, we can attempt to use JPEG2000 Region of Interest encoding, and we will also have the mask exclude any/all parts of the photo. Then we can encode the photo as part of the background, and try to get higher quality at the regions where we think we have photos. openjpeg/grok has some form of ROI and kakadu also has So to summarise, I think we can make the software handle this better if it knows what regions are images, ideally we get that from the hOCR file, but we can think of another way provide that information through scantailor (or custom code). BTW: You can still get better compression currently than pdfbeads by just providing higher quality Useful links: |
Suggestion to perform roi like this:
I could give that a try later this week. |
I've not seen that working either. @trufanov-nok has some similar work on getting those split files in a .djvu, but I've not tried them yet. |
"Merging" them as layers is not possible in PDFs, but you can have images on top of each other with transparency. Or you can merge them before you insert them. But that wouldn't be necessary if we try use some of my above comments. I have never used scantailor but it looks cool, maybe we can support using scantailor to clean up documents some. |
I saw JPEG2000 also has a composite JPM format, meant for MRC. I don't know if that has more possibilities than PDF already has, but as JPEG2000 is part of PDF one would expect those JPM-possibilities to be usable in a PDF. |
I don't think it really matters, you'd still be encoding the JBIG2 and JPEG2000 images separately in the JPM (which what we do in the PDFs too, at little overhead), but JPM support is non existing in almost all the tooling as far as I can tell, making it not a great thing to target. |
@rmast The option is available with Scantailor Advanced, which is still overall better than ScanTailor Universal. https://github.com/4lex4/scantailor-advanced/releases I used Scantailor Advanced's It also has a PDFbeads performs the splitting separately, though based on Scantailor mixed output (though IIRC it can do something on its own with some options). @MerlijnWajer Do you happen to be aware of any existing tooling that can do one of the things that pdfbeads does, namely make the JBIG2 a transparent layer in the PDF (I guess that's what happens, then the downscaled image gets underlaid), but from a specific JBIG2 tiff input? |
Not sure if I follow. The way archive-pdf-tools works, overly simplified:
This is actually visible if your computer is sufficiently slow: first the background image will finish decompressing, at which point you will see it, and only later the "text" (foreground) layer appears. So it sounds like it does what you're suggesting, right? |
Or rather, adding an image with JBIG2 as transparency layer is what it does already -- so we have code that can do it, iiuc. |
This discussion is popped up in my notifications, and I'm not sure if this is relevant, but I would like to note that ScanTailor is a some kind of semi-automatic text-to-image segmentator. And by default it outputs a single image. But it seems 12 years ago the author have decided to reserve pure white (0x??FFFFFF) and pure black (0x??000000) to the text parts and there is no such colors in illustration parts of the result. I mean if scantailor treats something as illustration - the variability of colors of its pixels is limited to all colors except two. And you can't expect to find pure black or pure white pixels there. It seems the "export" functionality was introduced 7 years ago in ScanTailor Featured and it was adapted as legacy by the currently active forks - Advanced and Featured. But basically it just reads the output image pixel by pixel and writes it to the two different image files: one b/w only, one with everything except it. We call them "layers" but that's just a reference to a so called "methods of a separate layers" - an approach of assembling a DjVu documents bypassing the fact that opensource djvu encoders lacks text-to-image segmentators at all and the commercial one can make mistakes. One of the output files gets a ".sep" suffix and such pair of imeges is designed to be used with "DjVu Imager" application. The idea is to encode (with commercial or opensource encoder) the bundled b/w DjVu document and later automatically insert the illustrations into it with DjVu Imager by matching the ".sep" files to the corresponding pages by filenames. So you don't need to rely on automatic segmentation at all. Which gives you a best looking illustrations. |
So one of the issues with background pictures containing fuzz behind the foreground is not possible with this reserveBlackAndWhite output. I don't think the surrounding pixels of that reservedBlackAndWhite are cleaned up by ScanTailor as is documented with DjVu, by meeting vectors of gradients in the original picture only extending the background-vector. |
This is relevant to my interests. I'm also curious to see if ROI compressing a scan to JP2 using the mask If it helps, using a mask image with
|
@Redsandro - right, please feel free to try and toy around with |
I don’t think a 1:1 mask generated from a binarized picture would reveal regions of interest. Most of the page is just fuzzy black or fuzzy white. A page with a fuzzy signature could probably benefit from this ROI for the signature. Question is whether recognition of those ROI spots can be automated or needs a manual activity.
|
If you can get ROI encoding working in Kakadu, I can add support for the |
@rmast commented:
After toying around, I observe the kdu_compress -i in.tif -o out.jp2 -no_weights -rate 0.5 Creversible=no Rweight=16 Rlevels=2 -roi mask.pgm,0.5 By default the ROI mask consists of 128x128 pixel patches. To make a more accurate mask, you need to set The thing to keep in mind though is that setting a lower If you want to know more about flags, this is helpful, although some of the defaults are different on my build/machine. |
Hey, looks like you actually got it to work. That's great. I'll try to look at how we can use/integrate this to compress better (accuracy / size). |
When I think of a way to get the PostNL-bill compressed that I used before as a test-subject I could imagine to use the high-density part for the square ocr_photo-frame around the logo as ROI. The grey dithered drawings on the bottom are not found as ocr_photo by tesseract, so they would just end up grayish outside the ROI. That way, using the ROI in the backgroundpicture I would expect the ROI to keep some quality for the logo on the background. I'm curious whether this would give a better quality/compression. |
@Redsandro -JFYI I still plan to work on this, I just had a long work trip and am only just coming back to this, and these kinds of improvements are more or less spare-time projects. Maybe in a few days I can make a branch with this integrated. One thing we'll need is some testing framework to do comparisons (PSNR, SSIM, etc). I think I made a start with that, so we could compare to see how well ROI helps with compression ratios and quality. |
No problem. I understand. You may want to manually try what you had in mind initially to see if it is roughly as useful as we hope. Because if the quality over compression ratio is really not close to interesting when keeping in mind the code block size limitations, it would be a waste to set up a lot of scaffolding. |
The grey ABN AMRO-text on top of the ABN-AMRO-letter is recognized by tesseract as text size 75 in a bounding box. The shield-logo to the left appears to be recognized as an apostrophe in a bounding box. |
I figured we'd still use the JBIG2, and just get more quality for the parts that we care about. We could see how it works without the mask, but I'm a bit sceptical. |
So all text will be masked by JBIG2 colored by a low res mask coloring picture and photo-elements will get ROI attention on the background picture. Usually that means text 300 dpi, background picture 100 dpi?
|
I have pushed some code here: https://github.com/internetarchive/archive-pdf-tools/tree/roi The wheels will end up here: https://github.com/internetarchive/archive-pdf-tools/actions/runs/2276084863 I can run it like so (for testing purposes):
ROI mode is currently enabled when "Creversible=no" is found in the flags (literally) - and that is a hack, I know. The background seems to improve with the mask, the foreground not so much? (For the background, we use the inverted mask) - I hope I didn't swap the inverted-ness for background/foreground. With the above parameters the size is about the same as without roi and default slope values. Maybe give it a try? |
I have also pushed a commit where I swapped the revertness, which will build separately as an action: https://github.com/internetarchive/archive-pdf-tools/actions/runs/2276139177 I also changed the rate from
vs the 'normal':
The background definitely looks less noisy. |
It might even also make sense to use different Cblk values for the background and the foreground, I can imagine. For the background we maybe don't need the regions to be that small, but for the foreground we likely do want that. |
I think in general this looks like it can offer an improvement, but I'll need to think how it can be integrated properly, maybe it's time to offer some encoding "profiles", so that people can pick without having to fiddle with the exact OpenJPEG flags, or kakadu rates, etc... |
I experimented with didjvu via c44 a while ago. jwilk/didjvu#19 With subsample ratios 3 to 5 for the background picture c44 was able to almost clear the background (I guess by using the patented vector estimation to filter out the surrounding fuzz resulting by partial-pixels with a color between the fore and background). The patent has expired and c44 is open source. So to clear the background fuzz there might be another option.
|
@rmast - could you share some command lines to go from a tif/png/pgm/etc background image (before I "optimse" them, but after "removing" the foreground) to a djvu component which is then converted back to png? That would ease testing. |
The ROI encoding I think ought to be useful in any case (at least in theory). |
I looked into the working of didjvu calling c44 with masks. It appears to use mask.erode and mask.dilate before calling c44 for the fore- and background to clear the foreground and background-fuzz, so I was probably wrong in assuming c44 does the trick.
|
Could you provide some literal steps I could try to reproduce what the didjvu stuff does for the background generation? |
BTW: this branch contains a bunch of test images (and hocr files to go along with them) in case you wanted to try the djvu stuff on other examples: https://github.com/internetarchive/archive-pdf-tools/tree/tests/tests |
It's as if you can read my mind about those test-cases, I looked up your above example to replay some scenarios. This is the default result of DjVuSolo3.1/DjVuToy, the text unfortunately isn't correctly thresholded: This unfortunately has no command-line replay steps. This is the result of didjvu followed by DjVuToy to make a djvu back into a PDF: The first step is done by In the original the text from the other side of the page shines through. The best thresholding-algorithm in my opinion would just wipe the other side, but leave the letters on top complete without dents, and still not glue letters together ending up in just a bitonal image instead of a MRC picture. Usually finding the best threshold appears to be a manual selection process, however I saw some attempts involving a GPU to do some heavy AI on it, already using knowledge of characters to do the thresholding. I am now distracted by getting a better bitonal picture from this example. It's quite a difficult example for thresholding to a clear bitonal picture of the intended print on one side of the paper. |
Yeah, there are some ways to improve the sim_english scenario, but they then again cause issues with other scenarios. The DjvuToy background looks good, but it does seem like it might mess up images a bit more. |
For example this binarizer is tempting to try, despite the somewhat open characters in the example-result: https://github.com/NVlabs/ocropus3-ocrobin
|
There actually is an ocropus 4 in the works which is likely going to be faster/better: https://github.com/ocropus/ocropus4 - I've talked about it with Tom in the past, but I haven't been able to dedicate my time to help out too much. At this point, should we create a separate issue for trying to figure out how DjVu implementation can maybe help? Or maybe scantailor can? Maybe it makes sense to make an issue with an overview the various other projects. |
I guess we should look for some good examples that will really benefit from MRC.
Your covid-manually-filled-health form is a good example.
I also tend to take letters with a logo and an autograph, however I should anonymize some.
Black and white examples are an invitation for other approaches.
I tried scantailor universal and advanced today on your english example, but wasn´t satisfied with the offered binarization. Even a Gimp binarizing/smoothing filter from the diybookscanner.org/forum didn´t satisfy me, although it was the first time it had a result with default settings. I keep seeing small lines in characters that disappear and dent characters when binarizing.
|
I now tried |
If you look at the Google-scan of the same book then Google alternates bitonal and greyscale-pages where images are visible:
https://babel.hathitrust.org/cgi/pt?id=mdp.39015056059697&view=1up&seq=145&q1=gainsborough
…________________________________
Van: Merlijn Wajer ***@***.***>
Verzonden: zaterdag 7 mei 2022 17:52
Aan: internetarchive/archive-pdf-tools ***@***.***>
CC: rmast ***@***.***>; Mention ***@***.***>
Onderwerp: Re: [internetarchive/archive-pdf-tools] Usefulness of MRC for decent quality compression of scanned book pages with illustrations (Issue #33)
Yeah, there are some ways to improve the sim_english scenario, but they then again cause issues with other scenarios. The DidJu background looks good, but it does seem like it might mess up images a bit more.
—
Reply to this email directly, view it on GitHub<#33 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5XRHAQISIYFKYBHY6TVI2GS5ANCNFSM5JO3SD5A>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Opening a new issue as requested.
Here are some samples: https://mega.nz/folder/BRhChKob#xo-HHaJrD9VYN6YV3ur9WA
128.tif & 188.tif - original cleaned up 600dpi scans
*-scantailor.tif - 600dpi mixed output with bitonal text and color photos, as autodetected
*-scantailor-pdfbeads.pdf - above .tif split into two layers, with the text layer jbig2-encoded and the background layer JP2-encocded downsampled to 150dpi, and everything encoded in a pdf using
pdfbeads
*.jp2 - some compressed versions of the original, forgot the settings. Page 128 is almost half as small as the PDF's so I assume PDF sizes can be slightly improved.
The folders have some residual files.
ScanTailor
itself can now split tiffs, though I have no idea how to merge them as layers in a PDF. (That would be useful to learn.)Can MRC output get to be anything comparable to the PDFs at the same or lower size? I'm also curious whether it can be achieved directly from the original cleaned up scan or the ScanTailor mixed output step is still advised.
The text was updated successfully, but these errors were encountered: