Apache Tika: Performance-optimized configuration for hybrid PDFs (Text + OCR)

I recently ran into this issue when processing a massive backlog of documents. The server completely stalled, sometimes taking up to 30 minutes for a single page. It turned out Tika was trying to calculate skew angles and perform OCR on thousands of microscopic layout artifacts.

When extracting text from hybrid PDFs (documents containing both digital text and embedded images like scanned tables), two common issues occur with Apache Tika:

Duplicate text: Tika extracts the digital text and subsequently sends a screenshot of the entire page to Tesseract (OCR). The resulting output contains the text twice.
Extreme performance drops: When enabling extractInlineImages="true", Tika extracts every tiny pixel, border, or logo snippet as a separate image and sends it to the OCR engine. In combination with detectAngles="true", this can skyrocket the processing time per page to several minutes (up to 30 mins).

Solution

An adapted tika-config.xml that prioritizes digital text and extracts embedded images, but excludes tiny layout elements from OCR using a minimum file size threshold (minFileSizeToOcr).

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
            <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
            <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
        </parser>

        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="extractInlineImages" type="bool">true</param>
                
                <param name="ocrStrategy" type="string">auto</param>
                
                <param name="ocrTimeout" type="int">120</param>
                <param name="ocrResponseType" type="string">text</param>
                <param name="detectAngles" type="bool">true</param>
            </params>
        </parser>

        <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
            <params>
                <param name="language" type="string">deu+eng</param>
                <param name="enableImageProcessing" type="bool">true</param>
                
                <param name="minFileSizeToOcr" type="long">10000</param>
            </params>
        </parser>
    </parsers>
    <service-loader initializableProblemHandler="ignore"/>
</properties>

minFileSizeToOcr="10000": The "gatekeeper" for Tesseract. Layout artifacts, colored bullet points, or tiny logos (< 10 KB) are immediately discarded. Only genuine content (embedded screenshots, scanned tables) reaches Tesseract.
detectAngles="true": Because the 10 KB filter removes 99% of the image junk, the deskew algorithm can safely remain enabled without bringing the server to its knees. It straightens skewed scans and preserves the OCR quality.

ocrStrategy: `auto` vs. `ocr_and_text_extraction`

auto: If Tika finds at least 10 characters of digital text on the page, it will not create a full-page OCR screenshot. This reliably prevents duplicate text outputs while still allowing embedded images (like tables) to be processed via extractInlineImages.
ocr_and_text_extraction: This strategy forces Tika to extract the digital text AND perform a full-page OCR scan, regardless of how much text is already present. This can be useful as a fallback if the digital text layer is completely corrupted (e.g., bad font encoding), but it also causes the duplicate text on regular hybrid PDFs. Unless you explicitly need both layers, stick to auto.

Florian Leinsinger

makandra.de

Say thanks

Last edit

2026-04-14

Florian Leinsinger

License

Source code in this card is licensed under the MIT License.

Posted by Florian Leinsinger to makandra dev (2026-04-14 10:25)

Apache Tika: Performance-optimized configuration for hybrid PDFs (Text + OCR)

Solution

ocrStrategy: auto vs. ocr_and_text_extraction

ocrStrategy: `auto` vs. `ocr_and_text_extraction`