tesseract.js: Pure Javascript OCR for 62 Languages

Posted . Visible to the public.

This might be relevant for us since we're often managing customer documents in our apps.

I played around with the library and this is what I found:

  • A 200 DPI scan of an English letter (500 KB JPEG) was processed in ~6 seconds on my desktop PC. It does the heavy lifting in a Web worker Show archive.org snapshot so you're rendering thread isn't blocked.
  • It detected maybe 95% of the text flawlessly. It has difficulties with underlined text or tight table borders.
  • When you feed it an image that has other objects than text in it, it will detect the text, but also a lot of false positives. So be prepared to offer some additional UI to select relevant portions before or after OCR.
  • If the image is distorted by camera perspective, detection is horrible. So be prepared to offer some additional UI to dewarp the image before OCR (maybe using something like imgwarp-js Show archive.org snapshot ).

Tesseract can detect words from the following sources (from the README):

  • an <img>, <video>, or <canvas> element
  • a CanvasRenderingContext2D (returned by canvas.getContext('2d'))
  • a File object (from a file or drag-drop event)
  • a Blob object
  • a ImageData instance (an object containing width, height and data properties)
  • a path or URL to an accessible image (the image must either be hosted locally or accessible by CORS)
Henning Koch
Last edit
Henning Koch
License
Source code in this card is licensed under the MIT License.
Posted by Henning Koch to makandra dev (2016-10-18 07:05)