I recently ran into this issue when processing a massive backlog of documents. The server completely stalled, sometimes taking up to 30 minutes for a single page. It turned out Tika was trying to calculate skew angles and perform OCR on thousands of microscopic layout artifacts.
When extracting text from hybrid PDFs (documents containing both digital text and embedded images like scanned tables), two common issues occur with Apache Tika:
- Duplicate text: Tika extracts the digital text and subsequently sends a screenshot of the entire page to Tesseract (OCR). The resulting output contains the text twice.
-
Extreme performance drops: When enabling
extractInlineImages="true", Tika extracts every tiny pixel, border, or logo snippet as a separate image and sends it to the OCR engine. In combination withdetectAngles="true", this can skyrocket the processing time per page to several minutes (up to 30 mins).
Solution
An adapted tika-config.xml that prioritizes digital text and extracts embedded images, but excludes tiny layout elements from OCR using a minimum file size threshold (minFileSizeToOcr).
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
</parser>
<parser class="org.apache.tika.parser.pdf.PDFParser">
<params>
<param name="extractInlineImages" type="bool">true</param>
<param name="ocrStrategy" type="string">auto</param>
<param name="ocrTimeout" type="int">120</param>
<param name="ocrResponseType" type="string">text</param>
<param name="detectAngles" type="bool">true</param>
</params>
</parser>
<parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
<params>
<param name="language" type="string">deu+eng</param>
<param name="enableImageProcessing" type="bool">true</param>
<param name="minFileSizeToOcr" type="long">10000</param>
</params>
</parser>
</parsers>
<service-loader initializableProblemHandler="ignore"/>
</properties>
-
minFileSizeToOcr="10000": The "gatekeeper" for Tesseract. Layout artifacts, colored bullet points, or tiny logos (< 10 KB) are immediately discarded. Only genuine content (embedded screenshots, scanned tables) reaches Tesseract. -
detectAngles="true": Because the 10 KB filter removes 99% of the image junk, the deskew algorithm can safely remain enabled without bringing the server to its knees. It straightens skewed scans and preserves the OCR quality.
ocrStrategy: auto vs. ocr_and_text_extraction
-
auto: If Tika finds at least 10 characters of digital text on the page, it will not create a full-page OCR screenshot. This reliably prevents duplicate text outputs while still allowing embedded images (like tables) to be processed viaextractInlineImages. -
ocr_and_text_extraction: This strategy forces Tika to extract the digital text AND perform a full-page OCR scan, regardless of how much text is already present. This can be useful as a fallback if the digital text layer is completely corrupted (e.g., bad font encoding), but it also causes the duplicate text on regular hybrid PDFs. Unless you explicitly need both layers, stick to auto.