This might be relevant for us since we're often managing customer documents in our apps.
I played around with the library and this is what I found:
- A 200 DPI scan of an English letter (500 KB JPEG) was processed in ~6 seconds on my desktop PC. It does the heavy lifting in a Web worker Show archive.org snapshot so you're rendering thread isn't blocked.
 - It detected maybe 95% of the text flawlessly. It has difficulties with underlined text or tight table borders.
 - When you feed it an image that has other objects than text in it, it will detect the text, but also a lot of false positives. So be prepared to offer some additional UI to select relevant portions before or after OCR.
 - If the image is distorted by camera perspective, detection is horrible. So be prepared to offer some additional UI to dewarp the image before OCR (maybe using something like imgwarp-js Show archive.org snapshot ).
 
Tesseract can detect words from the following sources (from the README):
- an 
<img>,<video>, or<canvas>element - a 
CanvasRenderingContext2D(returned bycanvas.getContext('2d')) - a 
Fileobject (from a file<input>or drag-drop event) - a 
Blobobject - a 
ImageDatainstance (an object containing width, height and data properties) - a path or URL to an accessible image (the image must either be hosted locally or accessible by CORS)
 
Posted by Henning Koch to makandra dev (2016-10-18 07:05)