Read more

tesseract.js: Pure Javascript OCR for 62 Languages

Henning Koch
October 18, 2016Software engineer at makandra GmbH

This might be relevant for us since we're often managing customer documents in our apps.

Illustration online protection

Rails Long Term Support

Rails LTS provides security patches for old versions of Ruby on Rails (2.3, 3.2, 4.2 and 5.2)

  • Prevents you from data breaches and liability risks
  • Upgrade at your own pace
  • Works with modern Rubies
Read more Show archive.org snapshot

I played around with the library and this is what I found:

  • A 200 DPI scan of an English letter (500 KB JPEG) was processed in ~6 seconds on my desktop PC. It does the heavy lifting in a Web worker Show archive.org snapshot so you're rendering thread isn't blocked.
  • It detected maybe 95% of the text flawlessly. It has difficulties with underlined text or tight table borders.
  • When you feed it an image that has other objects than text in it, it will detect the text, but also a lot of false positives. So be prepared to offer some additional UI to select relevant portions before or after OCR.
  • If the image is distorted by camera perspective, detection is horrible. So be prepared to offer some additional UI to dewarp the image before OCR (maybe using something like imgwarp-js Show archive.org snapshot ).

Tesseract can detect words from the following sources (from the README):

  • an <img>, <video>, or <canvas> element
  • a CanvasRenderingContext2D (returned by canvas.getContext('2d'))
  • a File object (from a file or drag-drop event)
  • a Blob object
  • a ImageData instance (an object containing width, height and data properties)
  • a path or URL to an accessible image (the image must either be hosted locally or accessible by CORS)
Posted by Henning Koch to makandra dev (2016-10-18 09:05)