Read more

tesseract.js: Pure Javascript OCR for 62 Languages

Henning Koch
October 18, 2016Software engineer at makandra GmbH

This might be relevant for us since we're often managing customer documents in our apps.

Illustration UI/UX Design

UI/UX Design by makandra brand

We make sure that your target audience has the best possible experience with your digital product. You get:

  • Design tailored to your audience
  • Proven processes customized to your needs
  • An expert team of experienced designers
Read more Show archive.org snapshot

I played around with the library and this is what I found:

  • A 200 DPI scan of an English letter (500 KB JPEG) was processed in ~6 seconds on my desktop PC. It does the heavy lifting in a Web worker Show archive.org snapshot so you're rendering thread isn't blocked.
  • It detected maybe 95% of the text flawlessly. It has difficulties with underlined text or tight table borders.
  • When you feed it an image that has other objects than text in it, it will detect the text, but also a lot of false positives. So be prepared to offer some additional UI to select relevant portions before or after OCR.
  • If the image is distorted by camera perspective, detection is horrible. So be prepared to offer some additional UI to dewarp the image before OCR (maybe using something like imgwarp-js Show archive.org snapshot ).

Tesseract can detect words from the following sources (from the README):

  • an <img>, <video>, or <canvas> element
  • a CanvasRenderingContext2D (returned by canvas.getContext('2d'))
  • a File object (from a file or drag-drop event)
  • a Blob object
  • a ImageData instance (an object containing width, height and data properties)
  • a path or URL to an accessible image (the image must either be hosted locally or accessible by CORS)
Posted by Henning Koch to makandra dev (2016-10-18 09:05)