When an LLM model has vision capabilities Show archive.org snapshot , you can attach Base64-encoded images to chat messages and it will load them into its context for analysis.
Text tokens vs. image tokens
Modern LLMs process text and images very differently. Text is split into tokens, individual words or word fragments that each take up little context space. A page of text might produce a few hundred tokens.
Images, on the other hand, are divided into a grid of patches (e.g. 14x14 pixels), where each patch counts as its own token. A typical 1000x1000 pixel image easily produces several thousand tokens. The token count for images scales with the pixel count. Larger images mean more tokens, longer processing time, and higher cost.
Scale images down
Scaling images to a maximum dimension is almost always a good idea. At some point you will see quality drops though, depending on the task:
- For classification tasks (e.g. "Does this image contain text?"), medium thunbnails are usually enough.
- For OCR or information extraction, aim for 1200-2000px so text remains readable.
To convert PDF pages to scaled down images:
pdftoppm -png -scale-to 1500 input.pdf output_prefix
Convert images to thumbnails:
vipsthumbnail input.png --size 1500x1500 -o output.png
Trim whitespace
Most document pages have generous margins. Removing this whitespace saves tokens without losing information.
One-liner with libvips (for imagemagic see this card)
vips crop input.png output.png $(vips find_trim input.png --background 255 --threshold 10)
This finds the bounding box of non-white content and crops to it. --threshold 10 tolerates slight color deviations from the background. To add padding after trimming (so content doesn't stick to the edge), pipe through vips embed or handle it programmatically like the Ruby solution below. Note that this will not trim anything if the background has a darker color.
Note
Vision models process images in a fixed grid of patches. If only a few pixels are removed at the edges, the effect on token count can be minimal, as patches are padded internally to fill the grid. But since pixel count scales with width x height, autocrop is usually still worth it.
Crop PDF pages without their page numbers
PDF pages often have large whitespace areas, except for a small page number in a corner. Naive trimming algorithms like that one above see this page number as content and thus barely crop the page.
A more advanced solution splits the page into vertical content blocks and identifies isolated small artifacts (headers, footers, page numbers). Here is an approach we use in production with ruby-vips Show archive.org snapshot .
Proposed solution
The core idea: a naive find_trim crops to the outermost non-white pixel — which may be a page number in the corner, preventing a useful crop. This algorithm works around it.
First, find_trim gets the tightest possible bounding box and we crop to it. Then, within that box, we project all ink pixels onto the Y-axis to get a 1D "ink density" array — one value per row, representing how much content that row contains. Rows of zero density are whitespace; stretches of non-zero density are content.
We scan this array and group rows into vertical content blocks, split by any whitespace gap taller than 10% of the page. If the first or last block is smaller than 10% of the page height, we treat it as an artifact (page number, running header) and drop it. Finally, a second find_trim pass on the remaining blocks tightens up any whitespace newly exposed by the removal.
Implementation example
require 'vips'
module Image
class Autocrop
def self.perform(image, threshold: 40, padding: 20)
# Step 1: Native VIPS find_trim for the tightest bounding box
begin
content_left, content_top, content_width, content_height = image.find_trim(threshold:)
rescue Vips::Error
return # Blank page
end
return if content_width < 10 || content_height < 10
trimmed_image = image.crop(content_left, content_top, content_width, content_height)
# Step 2: Flatten to B&W mask and project onto Y-axis.
# Yields a 1D array of "ink density" per row.
ink_pixels_mask = trimmed_image.colourspace('b-w').extract_band(0) < (255 - threshold)
_, y_axis_projection = ink_pixels_mask.project
row_ink_densities = y_axis_projection.to_a.flatten
# Step 3: Group rows into content blocks separated by significant whitespace
minimum_whitespace_gap_height = (image.height * 0.10).to_i # 10% of total page height
maximum_artifact_block_height = (image.height * 0.10).to_i # 10% of total page height
significant_trim_threshold = (image.height * 0.15).to_i # if we trim 15% whitespace on one side, we no longer search for page numbers etc. as it could be content
content_blocks = find_content_blocks(row_ink_densities, minimum_whitespace_gap_height, content_height)
if content_blocks.empty?
return apply_padding(image, content_left, content_top, content_width, content_height, padding)
end
# Step 4: Remove small isolated blocks at start/end (headers/footers/page numbers)
blocks_removed = false
# Remove Header if it's an isolated, small block AND we haven't already trimmed a lot of top whitespace
if content_top < significant_trim_threshold &&
content_blocks.size > 1 &&
content_blocks.first[:height] < maximum_artifact_block_height
content_blocks.shift
blocks_removed = true
end
# Remove Footer if it's an isolated, small block AND we haven't already trimmed a lot of bottom whitespace
bottom_whitespace = image.height - (content_top + content_height)
if bottom_whitespace < significant_trim_threshold &&
content_blocks.size > 1 &&
content_blocks.last[:height] < maximum_artifact_block_height
content_blocks.pop
blocks_removed = true
end
if blocks_removed
# Step 5: Second find_trim pass on remaining content
content_start_y = content_blocks.first[:start_y]
content_end_y = content_blocks.last[:end_y]
main_content = trimmed_image.crop(0, content_start_y, content_width, content_end_y - content_start_y + 1)
begin
# Run a second pass of find_trim to tighten up any newly exposed whitespace
inner_left, inner_top, inner_width, inner_height = main_content.find_trim(threshold:)
rescue Vips::Error
return # Now it's a blank page
end
# Map the inner coordinates back to absolute coordinates on the original image
final_left = content_left + inner_left
final_top = content_top + content_start_y + inner_top
else
# Skip the second trim entirely if we didn't remove any artifact blocks
final_left = content_left
final_top = content_top
inner_width = content_width
inner_height = content_height
end
apply_padding(image, final_left, final_top, inner_width, inner_height, padding)
end
private
def self.find_content_blocks(row_ink_densities, gap_height, total_height)
blocks = []
block_start = nil
gap_start = nil
row_ink_densities.each_with_index do |density, y|
if density > 0
if block_start.nil?
block_start = y
elsif gap_start && (y - gap_start) > gap_height
blocks << { start_y: block_start, end_y: gap_start - 1, height: gap_start - block_start }
block_start = y
end
gap_start = nil
else
gap_start ||= y
end
end
if block_start
end_y = gap_start ? gap_start - 1 : total_height - 1
blocks << { start_y: block_start, end_y: end_y, height: end_y - block_start + 1 }
end
blocks
end
def self.apply_padding(original_image, left, top, width, height, padding)
pad_left = [0, left - padding].max
pad_top = [0, top - padding].max
pad_width = [original_image.width - pad_left, width + (left - pad_left) + padding].min
pad_height = [original_image.height - pad_top, height + (top - pad_top) + padding].min
original_image.crop(pad_left, pad_top, pad_width, pad_height)
end
end
end
Usage example
require 'vips'
image = Vips::Image.new_from_file('input.png')
image = image.thumbnail_image(1500, height: 1500, size: :down)
cropped = Autocrop.perform(image)
cropped&.write_to_file('output.png')
Use at your own risk
This works well for typical document pages with body text. Pages with scattered content (infographics, posters) may lose relevant information. Test with representative samples.