Optimizing images for LLM inference

Posted . Visible to the public.

When an LLM model has vision capabilities Show archive.org snapshot , you can attach Base64-encoded images to chat messages and it will load them into its context for analysis.

Text tokens vs. image tokens

Modern LLMs process text and images very differently. Text is split into tokens, individual words or word fragments that each take up little context space. A page of text might produce a few hundred tokens.

Images, on the other hand, are divided into a grid of patches (e.g. 14x14 pixels), where each patch counts as its own token. A typical 1000x1000 pixel image easily produces several thousand tokens. The token count for images scales with the pixel count. Larger images mean more tokens, longer processing time, and higher cost.

Scale images down

Scaling images to a maximum dimension is almost always a good idea. At some point you will see quality drops though, depending on the task:

  • For classification tasks (e.g. "Does this image contain text?"), medium thunbnails are usually enough.
  • For OCR or information extraction, aim for 1200-2000px so text remains readable.

To convert PDF pages to scaled down images:

pdftoppm -png -scale-to 1500 input.pdf output_prefix

Convert images to thumbnails:

vipsthumbnail input.png --size 1500x1500 -o output.png

Trim whitespace

Most document pages have generous margins. Removing this whitespace saves tokens without losing information.

One-liner with libvips (for imagemagic see this card)

vips crop input.png output.png $(vips find_trim input.png --background 255 --threshold 10)

This finds the bounding box of non-white content and crops to it. --threshold 10 tolerates slight color deviations from the background. To add padding after trimming (so content doesn't stick to the edge), pipe through vips embed or handle it programmatically like the Ruby solution below. Note that this will not trim anything if the background has a darker color.

Note

Vision models process images in a fixed grid of patches. If only a few pixels are removed at the edges, the effect on token count can be minimal, as patches are padded internally to fill the grid. But since pixel count scales with width x height, autocrop is usually still worth it.

Crop PDF pages without their page numbers

PDF pages often have large whitespace areas, except for a small page number in a corner. Naive trimming algorithms like that one above see this page number as content and thus barely crop the page.

A more advanced solution splits the page into vertical content blocks and identifies isolated small artifacts (headers, footers, page numbers). Here is an approach we use in production with ruby-vips Show archive.org snapshot .

Proposed solution

The core idea: a naive find_trim crops to the outermost non-white pixel — which may be a page number in the corner, preventing a useful crop. This algorithm works around it.

First, find_trim gets the tightest possible bounding box and we crop to it. Then, within that box, we project all ink pixels onto the Y-axis to get a 1D "ink density" array — one value per row, representing how much content that row contains. Rows of zero density are whitespace; stretches of non-zero density are content.

We scan this array and group rows into vertical content blocks, split by any whitespace gap taller than 10% of the page. If the first or last block is smaller than 10% of the page height, we treat it as an artifact (page number, running header) and drop it. Finally, a second find_trim pass on the remaining blocks tightens up any whitespace newly exposed by the removal.

Implementation example

require 'vips'

module Image
  class Autocrop
    def self.perform(image, threshold: 40, padding: 20)
      # Step 1: Native VIPS find_trim for the tightest bounding box
      begin
        content_left, content_top, content_width, content_height = image.find_trim(threshold:)
      rescue Vips::Error
        return # Blank page
      end

      return if content_width < 10 || content_height < 10

      trimmed_image = image.crop(content_left, content_top, content_width, content_height)

      # Step 2: Flatten to B&W mask and project onto Y-axis.
      # Yields a 1D array of "ink density" per row.
      ink_pixels_mask = trimmed_image.colourspace('b-w').extract_band(0) < (255 - threshold)
      _, y_axis_projection = ink_pixels_mask.project
      row_ink_densities = y_axis_projection.to_a.flatten

      # Step 3: Group rows into content blocks separated by significant whitespace
      minimum_whitespace_gap_height = (image.height * 0.10).to_i # 10% of total page height
      maximum_artifact_block_height = (image.height * 0.10).to_i # 10% of total page height
      significant_trim_threshold = (image.height * 0.15).to_i # if we trim 15% whitespace on one side, we no longer search for page numbers etc. as it could be content

      content_blocks = find_content_blocks(row_ink_densities, minimum_whitespace_gap_height, content_height)

      if content_blocks.empty?
        return apply_padding(image, content_left, content_top, content_width, content_height, padding)
      end

      # Step 4: Remove small isolated blocks at start/end (headers/footers/page numbers)
      blocks_removed = false

      # Remove Header if it's an isolated, small block AND we haven't already trimmed a lot of top whitespace
      if content_top < significant_trim_threshold &&
         content_blocks.size > 1 &&
         content_blocks.first[:height] < maximum_artifact_block_height
        content_blocks.shift
        blocks_removed = true
      end

      # Remove Footer if it's an isolated, small block AND we haven't already trimmed a lot of bottom whitespace
      bottom_whitespace = image.height - (content_top + content_height)
      if bottom_whitespace < significant_trim_threshold &&
         content_blocks.size > 1 &&
         content_blocks.last[:height] < maximum_artifact_block_height
        content_blocks.pop
        blocks_removed = true
      end

      if blocks_removed
        # Step 5: Second find_trim pass on remaining content
        content_start_y = content_blocks.first[:start_y]
        content_end_y = content_blocks.last[:end_y]
        main_content = trimmed_image.crop(0, content_start_y, content_width, content_end_y - content_start_y + 1)

        begin
          # Run a second pass of find_trim to tighten up any newly exposed whitespace
          inner_left, inner_top, inner_width, inner_height = main_content.find_trim(threshold:)
        rescue Vips::Error
          return # Now it's a blank page
        end

        # Map the inner coordinates back to absolute coordinates on the original image
        final_left = content_left + inner_left
        final_top = content_top + content_start_y + inner_top
      else
        # Skip the second trim entirely if we didn't remove any artifact blocks
        final_left = content_left
        final_top = content_top
        inner_width = content_width
        inner_height = content_height
      end

      apply_padding(image, final_left, final_top, inner_width, inner_height, padding)
    end

    private

    def self.find_content_blocks(row_ink_densities, gap_height, total_height)
      blocks = []
      block_start = nil
      gap_start = nil

      row_ink_densities.each_with_index do |density, y|
        if density > 0
          if block_start.nil?
            block_start = y
          elsif gap_start && (y - gap_start) > gap_height
            blocks << { start_y: block_start, end_y: gap_start - 1, height: gap_start - block_start }
            block_start = y
          end
          gap_start = nil
        else
          gap_start ||= y
        end
      end

      if block_start
        end_y = gap_start ? gap_start - 1 : total_height - 1
        blocks << { start_y: block_start, end_y: end_y, height: end_y - block_start + 1 }
      end

      blocks
    end

    def self.apply_padding(original_image, left, top, width, height, padding)
      pad_left = [0, left - padding].max
      pad_top = [0, top - padding].max
      pad_width = [original_image.width - pad_left, width + (left - pad_left) + padding].min
      pad_height = [original_image.height - pad_top, height + (top - pad_top) + padding].min

      original_image.crop(pad_left, pad_top, pad_width, pad_height)
    end
  end
end

Usage example

require 'vips'
image = Vips::Image.new_from_file('input.png')
image = image.thumbnail_image(1500, height: 1500, size: :down)
cropped = Autocrop.perform(image)
cropped&.write_to_file('output.png')

Use at your own risk

This works well for typical document pages with body text. Pages with scattered content (infographics, posters) may lose relevant information. Test with representative samples.

Profile picture of Michael Leimstädtner
Michael Leimstädtner
Last edit
Michael Leimstädtner
License
Source code in this card is licensed under the MIT License.
Posted by Michael Leimstädtner to makandra dev (2026-04-17 09:02)