tesseract.js: Pure Javascript OCR for 62 Languages

This might be relevant for us since we're often managing customer documents in our apps.

I played around with the library and this is what I found:

  • A 200 DPI scan of an English letter (500 KB JPEG) was processed in ~6 seconds on my desktop PC. It does the heavy lifting in a Web worker Show archive.org snapshot so you're rendering thread isn't blocked.
  • It detected maybe 95% of the text flawlessly. It has difficulties with underlined text or tight table borders.
  • When you feed it an image that has other objects than text in it, it will detect the text, but also a lot of false positives. So be prepared to offer some additional UI to select relevant portions before or after OCR.
  • If the image is distorted by camera perspective, detection is horrible. So be prepared to offer some additional UI to dewarp the image before OCR (maybe using something like imgwarp-js Show archive.org snapshot ).

Tesseract can detect words from the following sources (from the README):

  • an <img>, <video>, or <canvas> element
  • a CanvasRenderingContext2D (returned by canvas.getContext('2d'))
  • a File object (from a file or drag-drop event)
  • a Blob object
  • a ImageData instance (an object containing width, height and data properties)
  • a path or URL to an accessible image (the image must either be hosted locally or accessible by CORS)
Henning Koch Over 7 years ago