Read more

What's so hard about PDF text extraction? ​

Henning Koch
September 28, 2020Software engineer at makandra GmbH

There is a common view that extracting text from a PDF document should not be too difficult. After all, the text is right there in front of our eyes and humans consume PDF content all the time with great success. Why would it be difficult to automatically extract the text data?

Turns out, much how working with human names is difficult due to numerous edge cases and incorrect assumptions, working with PDFs is difficult due to the extreme flexibility given by the PDF format.

Illustration online protection

Rails Long Term Support

Rails LTS provides security patches for old versions of Ruby on Rails (2.3, 3.2, 4.2 and 5.2)

  • Prevents you from data breaches and liability risks
  • Upgrade at your own pace
  • Works with modern Rubies
Read more Show archive.org snapshot
Posted by Henning Koch to makandra dev (2020-09-28 10:14)