This might be relevant for us since we're often managing customer documents in our apps.
I played around with the library and this is what I found:
- A 200 DPI scan of an English letter (500 KB JPEG) was processed in ~6 seconds on my desktop PC. It does the heavy lifting in a Web worker Show archive.org snapshot so you're rendering thread isn't blocked.
- It detected maybe 95% of the text flawlessly. It has difficulties with underlined text or tight table borders.
- When you feed it an image that has other objects than text in it, it will detect the text, but also a lot of false positives. So be prepared to offer some additional UI to select relevant portions before or after OCR.
- If the image is distorted by camera perspective, detection is horrible. So be prepared to offer some additional UI to dewarp the image before OCR (maybe using something like imgwarp-js Show archive.org snapshot ).
Tesseract can detect words from the following sources (from the README):
- an <img>,<video>, or<canvas>element
- a CanvasRenderingContext2D(returned bycanvas.getContext('2d'))
- a Fileobject (from a file<input>or drag-drop event)
- a Blobobject
- a ImageDatainstance (an object containing width, height and data properties)
- a path or URL to an accessible image (the image must either be hosted locally or accessible by CORS)
Posted by Henning Koch to makandra dev (2016-10-18 07:05)