This might be relevant for us since we're often managing customer documents in our apps.
I played around with the library and this is what I found:
- A 200 DPI scan of an English letter (500 KB JPEG) was processed in ~6 seconds on my desktop PC. It does the heavy lifting in a Web worker Show archive.org snapshot so you're rendering thread isn't blocked.
- It detected maybe 95% of the text flawlessly. It has difficulties with underlined text or tight table borders.
- When you feed it an image that has other objects than text in it, it will detect the text, but also a lot of false positives. So be prepared to offer some additional UI to select relevant portions before or after OCR.
- If the image is distorted by camera perspective, detection is horrible. So be prepared to offer some additional UI to dewarp the image before OCR (maybe using something like imgwarp-js Show archive.org snapshot ).
Tesseract can detect words from the following sources (from the README):
- an
<img>
,<video>
, or<canvas>
element - a
CanvasRenderingContext2D
(returned bycanvas.getContext('2d')
) - a
File
object (from a file or drag-drop event) - a
Blob
object - a
ImageData
instance (an object containing width, height and data properties) - a path or URL to an accessible image (the image must either be hosted locally or accessible by CORS)
Posted by Henning Koch to makandra dev (2016-10-18 07:05)