Optimizing images for LLM inference

When an LLM model has vision capabilities Show archive.org snapshot , you can attach Base64-encoded images to chat messages, and it will load them into its context for analysis.

Note

For scanned pages that may be sideways or upside down, auto-rotate them first. Autocrop assumes upright images.

Text tokens vs. image tokens

Modern LLMs process text and images very differently. Text is split into tokens, individual words or word fragments that each take up little context space. A page of text might produce a few hundred tokens.

Images, on the other hand, are divided into a grid of patches (e.g. 14x14 pixels), where each patch counts as its own token. A typical 1000x1000 pixel image easily produces several thousand tokens. The token count for images scales with the pixel count. Larger images mean more tokens, longer processing time, and higher cost.

Scale images down

Scaling images to a maximum dimension is almost always a good idea. At some point you will see quality drops though, depending on the task:

For classification tasks (e.g. "Does this image contain text?"), medium thunbnails are usually enough.
For OCR or information extraction, aim for 1200-2000px so text remains readable.

To convert PDF pages to scaled down images:

pdftoppm -png -scale-to 1500 input.pdf output_prefix

Convert images to thumbnails:

vipsthumbnail input.png --size 1500x1500 -o output.png

Trim whitespace

Most document pages have generous margins. Removing this whitespace saves tokens without losing information.

One-liner with libvips (for imagemagic see this card)

vips crop input.png output.png $(vips find_trim input.png --background 255 --threshold 10)

This finds the bounding box of non-white content and crops to it. --threshold 10 tolerates slight color deviations from the background. To add padding after trimming (so content doesn't stick to the edge), pipe through vips embed or handle it programmatically like the Ruby solution below. Note that this will not trim anything if the background has a darker color.

Note

Vision models process images in a fixed grid of patches. If only a few pixels are removed at the edges, the effect on token count can be minimal, as patches are padded internally to fill the grid. But since pixel count scales with width x height, autocrop is usually still worth it.

Crop PDF pages without their page numbers

PDF pages often have large whitespace areas, except for a small page number in a corner. Naive trimming algorithms like that one above see this page number as content and thus barely crop the page.

A more advanced solution splits the page into vertical content blocks and identifies isolated small artifacts (headers, footers, page numbers). Here is an approach we use in production with ruby-vips Show archive.org snapshot .

Proposed solution

The core idea: a naive find_trim crops to the outermost non-white pixel — which may be a page number in the corner, preventing a useful crop. This algorithm works around it.

First, find_trim gets the tightest possible bounding box and we crop to it. Then, within that box, we project all ink pixels onto the Y-axis to get a 1D "ink density" array — one value per row, representing how much content that row contains. Rows of zero density are whitespace; stretches of non-zero density are content.

We scan this array and group rows into vertical content blocks, split by any whitespace gap taller than 10% of the page. If the first or last block is smaller than 10% of the page height, we treat it as an artifact (page number, running header) and drop it. Finally, a second find_trim pass on the remaining blocks tightens up any whitespace newly exposed by the removal.

Implementation example

require 'vips'

module Image
  class Autocrop
    def self.perform(image, threshold: 40, padding: 20)
      # Step 1: Native VIPS find_trim for the tightest bounding box
      begin
        content_left, content_top, content_width, content_height = image.find_trim(threshold:)
      rescue Vips::Error
        return # Blank page
      end

      return if content_width < 10 || content_height < 10

      trimmed_image = image.crop(content_left, content_top, content_width, content_height)

      # Step 2: Flatten to B&W mask and project onto Y-axis.
      # Yields a 1D array of "ink density" per row.
      ink_pixels_mask = trimmed_image.colourspace('b-w').extract_band(0) < (255 - threshold)
      _, y_axis_projection = ink_pixels_mask.project
      row_ink_densities = y_axis_projection.to_a.flatten

      # Step 3: Group rows into content blocks separated by significant whitespace
      minimum_whitespace_gap_height = (image.height * 0.10).to_i # 10% of total page height
      maximum_artifact_block_height = (image.height * 0.10).to_i # 10% of total page height
      significant_trim_threshold = (image.height * 0.15).to_i # if we trim 15% whitespace on one side, we no longer search for page numbers etc. as it could be content

      content_blocks = find_content_blocks(row_ink_densities, minimum_whitespace_gap_height, content_height)

      if content_blocks.empty?
        return apply_padding(image, content_left, content_top, content_width, content_height, padding)
      end

      # Step 4: Remove small isolated blocks at start/end (headers/footers/page numbers)
      blocks_removed = false

      # Remove Header if it's an isolated, small block AND we haven't already trimmed a lot of top whitespace
      if content_top < significant_trim_threshold &&
         content_blocks.size > 1 &&
         content_blocks.first[:height] < maximum_artifact_block_height
        content_blocks.shift
        blocks_removed = true
      end

      # Remove Footer if it's an isolated, small block AND we haven't already trimmed a lot of bottom whitespace
      bottom_whitespace = image.height - (content_top + content_height)
      if bottom_whitespace < significant_trim_threshold &&
         content_blocks.size > 1 &&
         content_blocks.last[:height] < maximum_artifact_block_height
        content_blocks.pop
        blocks_removed = true
      end

      if blocks_removed
        # Step 5: Second find_trim pass on remaining content
        content_start_y = content_blocks.first[:start_y]
        content_end_y = content_blocks.last[:end_y]
        main_content = trimmed_image.crop(0, content_start_y, content_width, content_end_y - content_start_y + 1)

        begin
          # Run a second pass of find_trim to tighten up any newly exposed whitespace
          inner_left, inner_top, inner_width, inner_height = main_content.find_trim(threshold:)
        rescue Vips::Error
          return # Now it's a blank page
        end

        # Map the inner coordinates back to absolute coordinates on the original image
        final_left = content_left + inner_left
        final_top = content_top + content_start_y + inner_top
      else
        # Skip the second trim entirely if we didn't remove any artifact blocks
        final_left = content_left
        final_top = content_top
        inner_width = content_width
        inner_height = content_height
      end

      apply_padding(image, final_left, final_top, inner_width, inner_height, padding)
    end

    private

    def self.find_content_blocks(row_ink_densities, gap_height, total_height)
      blocks = []
      block_start = nil
      gap_start = nil

      row_ink_densities.each_with_index do |density, y|
        if density > 0
          if block_start.nil?
            block_start = y
          elsif gap_start && (y - gap_start) > gap_height
            blocks << { start_y: block_start, end_y: gap_start - 1, height: gap_start - block_start }
            block_start = y
          end
          gap_start = nil
        else
          gap_start ||= y
        end
      end

      if block_start
        end_y = gap_start ? gap_start - 1 : total_height - 1
        blocks << { start_y: block_start, end_y: end_y, height: end_y - block_start + 1 }
      end

      blocks
    end

    def self.apply_padding(original_image, left, top, width, height, padding)
      pad_left = [0, left - padding].max
      pad_top = [0, top - padding].max
      pad_width = [original_image.width - pad_left, width + (left - pad_left) + padding].min
      pad_height = [original_image.height - pad_top, height + (top - pad_top) + padding].min

      original_image.crop(pad_left, pad_top, pad_width, pad_height)
    end
  end
end

Usage example

require 'vips'
image = Vips::Image.new_from_file('input.png')
image = image.thumbnail_image(1500, height: 1500, size: :down)
cropped = Autocrop.perform(image)
cropped&.write_to_file('output.png')

Use at your own risk

This works well for typical document pages with body text. Pages with scattered content (infographics, posters) may lose relevant information. Test with representative samples.

Alternative: OCR-driven strip segmentation (split instead of trim)

The autocrop above reduces a page to one tight bounding box - ideal when you send the whole page to the model. Sometimes you want the opposite: cut a page into several crops and run inference per crop. This helps when one page holds many distinct items: a single full-page call forces the model to track all of them at once, while a per-strip call keeps each prompt small and the effective resolution high.

The core idea is the same as the autocrop - group content into vertical blocks separated by whitespace gaps - but the goal flips from trimming to slicing, and we drive it from OCR word boxes instead of an ink-density projection.

How it differs

	Autocrop (above)	OCR-driven slicing
Goal	1 page → 1 tight box	1 page → N strips
Block source	ink-density projected onto the Y axis (OCR-free)	Tesseract word boxes → lines → gaps
Gap threshold	relative (~10% of page height) - only real section breaks	small absolute (~20px) - we want many cuts
Output	cropped page	per-strip crops fed to the model individually

Pipeline

OCR for layout, not text. Run Tesseract (--psm 6) and keep only the word boxes (x, y, w, h, conf). Drop left-margin artifacts (short / low-confidence words near the page edge).
Cluster words into lines by their Y-center, with a tolerance of ~0.55 × median word height.
Pick cuts in the gaps. Compute N evenly-spaced "ideal" cut positions, then snap each to the nearest blank gap between two lines (gap ≥ 20px). Cuts always land in whitespace, so a logical unit is never split mid-line.
Crop each strip with a constant horizontal content box (min/max word X + padding) and the strip's Y range. The horizontal box is effectively a smarter find_trim that ignores margin noise.

# Snap evenly-spaced ideal cuts onto real whitespace gaps.
def pick_cuts(lines, target_strips: 5, min_gap_px: 20)
  gaps = lines.each_cons(2).filter_map do |a, b|
    (a.bottom + b.top) / 2 if (b.top - a.bottom) >= min_gap_px
  end
  return [] if gaps.empty? || target_strips <= 1

  top, height = lines.first.top, lines.last.bottom - lines.first.top
  ideal = (1...target_strips).map { |i| top + height * i / target_strips }
  ideal.map { |y| gaps.min_by { |g| (g - y).abs } }.uniq.sort
end

pick_cuts only returns the Y positions of the cuts. Turning those into actual image crops is two more steps: build the strip boundaries, then crop each strip out of the source PNG. The boundaries are anchored on the first/last line (not the page edges), so the top and bottom margins never end up in a crop:

require "vips"

# A page is a list of OCR'd `lines`, each responding to top/bottom and
# left/right (the min/max X of its words). `image` is the rendered page.
def strips_for(image, lines, padding: 12)
  cuts       = pick_cuts(lines)
  boundaries = [lines.first.top, *cuts, lines.last.bottom]

  # One constant horizontal box for the whole page: tightest column that
  # holds all text, ignoring margin noise. This is the "smart find_trim".
  x_left  = lines.map(&:left).min  - padding
  x_right = lines.map(&:right).max + padding

  boundaries.each_cons(2).filter_map do |y0, y1|
    next if lines.none? { |l| l.top >= y0 && l.bottom <= y1 } # skip empty strips

    top    = [y0 - padding, 0].max
    bottom = [y1 + padding, image.height].min
    image.crop(x_left, top, x_right - x_left, bottom - top)
  end
end

Each returned strip is an independent image you feed to the model on its own:

strips_for(image, lines).each do |strip|
  png = strip.write_to_buffer(".png")
  result = run_inference(png)   # one small, high-resolution call per strip
  # …collect results, then stitch them back together in reading order
end

Trade-offs to keep in mind

Depends on OCR. Anything Tesseract can't read - handwriting, stamps, faint or colored ink - is invisible to both the cut logic and the horizontal crop bounds, so it can be clipped. If that content matters, fall back to the ink-density projection from the autocrop as a second source for the bounds; it sees ink the OCR misses.
Header/footer/page numbers are not removed by this slicing step (they end up in the first/last strip). The autocrop's "small leading/trailing block = artifact" rule is worth porting if they pollute your prompts.
Resolution is decoupled from cost. Because each call only sees its strip, you can render the page higher than you would for a single full-page call without blowing up per-call token count.

Michael Leimstädtner

Say thanks

Last edit

2026-06-29

Florian Leinsinger

License

Source code in this card is licensed under the MIT License.

Posted by Michael Leimstädtner to makandra dev (2026-04-17 09:02)