correct ratio determination for noise estimation #53

rmast · 2022-06-25T17:46:44Z

I solved issue #52 myself.

This reverts commit 1bf9bce.

MerlijnWajer · 2022-06-26T10:29:35Z

Thanks -- I will review this tonight or tomorrow at latest, I'm mostly on the road today.

rmast · 2022-06-26T11:03:06Z

The second commit is for solving this error:
#55 (comment)

MerlijnWajer · 2022-11-21T14:07:33Z

btw, I think I fixed this in 3c20a46 - can you confirm?

rmast · 2022-11-30T09:59:15Z

btw, I think I fixed this in 3c20a46 - can you confirm?

Without resetting up and retesting it I read through the issues to see what we were trying to solve.
In the text of #52, namely #52 (comment), I read some inline patch of mrc.py on the inversion that I don't see reflected. So I can imagine not all inversion is handled correctly.

The issue with the double text (Array) is caused by a segmentation bug in Tesseract which I've tried to crack during my summer holiday. However there's too little testing capacity and core-knowledge at Tesseract to allow core-changes to repair this segmentation, which caused the superior EasyOCR-segmentation to emerge.

At the end of my summer holiday this year I tried to get a complete new inversion based on the segmentation of EasyOCR and an algorithm to compare the inner color and the outer color of those found segments for the inversion choice. I unfortunately didn't have the time to mold it into a working product.

rmast · 2022-12-30T14:30:02Z

This Christmas Holiday my attention is distracted by new AI programming capabilities of OpenAI Codex, rolling on the ChatGPT-hype. As I'm really bad at Cython programming I'm trying to let Codex make consistent/improve my code for a new context sensitive inverter. I wonder whether there is an other approach for interpreting and segmenting documents at a more intelligent level: https://x-decoder-vl.github.io/

MerlijnWajer and others added 10 commits May 24, 2022 17:13

wip: add pdf-metadata-json and pdf-to-imagestack tools

4480e6f

pdf-metadata-json: use estimated_ppi, not dpi

19e0a1b

requirements: pull in latest hocr for pdf-to-hocr

0f957d8

setup: add extra scripts

e67a05f

add bin/compress-pdf-images

50d0bb0

wip/removeme: default to pillow encoding for test purposes

1bf9bce

add bin/pdfcomp

224ebd1

Revert "wip/removeme: default to pillow encoding for test purposes"

5df0dc3

This reverts commit 1bf9bce.

pdf-metadata-json: some improvements

a1137fe

correct ratio determination for noise estimation

2069bc0

pdf-metadata-json allow 'array' as Filter-key typ

150b1cf

MerlijnWajer force-pushed the pdf-metadata-tooling branch from a1137fe to e45f84b Compare September 3, 2022 05:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

correct ratio determination for noise estimation #53

correct ratio determination for noise estimation #53

rmast commented Jun 25, 2022

MerlijnWajer commented Jun 26, 2022

rmast commented Jun 26, 2022

MerlijnWajer commented Nov 21, 2022

rmast commented Nov 30, 2022

rmast commented Dec 30, 2022

correct ratio determination for noise estimation #53

Are you sure you want to change the base?

correct ratio determination for noise estimation #53

Conversation

rmast commented Jun 25, 2022

MerlijnWajer commented Jun 26, 2022

rmast commented Jun 26, 2022

MerlijnWajer commented Nov 21, 2022

rmast commented Nov 30, 2022

rmast commented Dec 30, 2022