When an LLM model has vision capabilities Show archive.org snapshot , you can attach Base64-encoded images to chat messages, and it will load them into its context for analysis.
Note
For scanned pages that may be sideways or upside down, auto-rotate them first. Autocrop assumes upright images.
Text tokens vs. image tokens
Modern LLMs process text and images very differently. Text is split into tokens, individual words or word fragments that each take up little context space. A page of text might produce a few hundred tokens.
Images, on the other hand, are divided into a grid of patches (e.g. 14x14 pixels), where each patch counts as its own token. A typical 1000x1000 pixel image easily produces several thousand tokens. The token count for images scales with the pixel count. Larger images mean more tokens, longer processing time, and higher cost.
Scale images down
Scaling images to a maximum dimension is almost always a good idea. At some point you will see quality drops though, depending on the task:
- For classification tasks (e.g. "Does this image contain text?"), medium thunbnails are usually enough.
- For OCR or information extraction, aim for 1200-2000px so text remains readable.
To convert PDF pages to scaled down images:
pdftoppm -png -scale-to 1500 input.pdf output_prefix
Convert images to thumbnails:
vipsthumbnail input.png --size 1500x1500 -o output.png
Trim whitespace
Most document pages have generous margins. Removing this whitespace saves tokens without losing information.
One-liner with libvips (for imagemagic see this card)
vips crop input.png output.png $(vips find_trim input.png --background 255 --threshold 10)
This finds the bounding box of non-white content and crops to it. --threshold 10 tolerates slight color deviations from the background. To add padding after trimming (so content doesn't stick to the edge), pipe through vips embed or handle it programmatically like the Ruby solution below. Note that this will not trim anything if the background has a darker color.
Note
Vision models process images in a fixed grid of patches. If only a few pixels are removed at the edges, the effect on token count can be minimal, as patches are padded internally to fill the grid. But since pixel count scales with width x height, autocrop is usually still worth it.
Crop PDF pages without their page numbers
PDF pages often have large whitespace areas, except for a small page number in a corner. Naive trimming algorithms like that one above see this page number as content and thus barely crop the page.
A more advanced solution splits the page into vertical content blocks and identifies isolated small artifacts (headers, footers, page numbers). Here is an approach we use in production with ruby-vips Show archive.org snapshot .
Proposed solution
The core idea: a naive find_trim crops to the outermost non-white pixel — which may be a page number in the corner, preventing a useful crop. This algorithm works around it.
First, find_trim gets the tightest possible bounding box and we crop to it. Then, within that box, we project all ink pixels onto the Y-axis to get a 1D "ink density" array — one value per row, representing how much content that row contains. Rows of zero density are whitespace; stretches of non-zero density are content.
We scan this array and group rows into vertical content blocks, split by any whitespace gap taller than 10% of the page. If the first or last block is smaller than 10% of the page height, we treat it as an artifact (page number, running header) and drop it. Finally, a second find_trim pass on the remaining blocks tightens up any whitespace newly exposed by the removal.
Implementation example
require 'vips'
module Image
class Autocrop
def self.perform(image, threshold: 40, padding: 20)
# Step 1: Native VIPS find_trim for the tightest bounding box
begin
content_left, content_top, content_width, content_height = image.find_trim(threshold:)
rescue Vips::Error
return # Blank page
end
return if content_width < 10 || content_height < 10
trimmed_image = image.crop(content_left, content_top, content_width, content_height)
# Step 2: Flatten to B&W mask and project onto Y-axis.
# Yields a 1D array of "ink density" per row.
ink_pixels_mask = trimmed_image.colourspace('b-w').extract_band(0) < (255 - threshold)
_, y_axis_projection = ink_pixels_mask.project
row_ink_densities = y_axis_projection.to_a.flatten
# Step 3: Group rows into content blocks separated by significant whitespace
minimum_whitespace_gap_height = (image.height * 0.10).to_i # 10% of total page height
maximum_artifact_block_height = (image.height * 0.10).to_i # 10% of total page height
significant_trim_threshold = (image.height * 0.15).to_i # if we trim 15% whitespace on one side, we no longer search for page numbers etc. as it could be content
content_blocks = find_content_blocks(row_ink_densities, minimum_whitespace_gap_height, content_height)
if content_blocks.empty?
return apply_padding(image, content_left, content_top, content_width, content_height, padding)
end
# Step 4: Remove small isolated blocks at start/end (headers/footers/page numbers)
blocks_removed = false
# Remove Header if it's an isolated, small block AND we haven't already trimmed a lot of top whitespace
if content_top < significant_trim_threshold &&
content_blocks.size > 1 &&
content_blocks.first[:height] < maximum_artifact_block_height
content_blocks.shift
blocks_removed = true
end
# Remove Footer if it's an isolated, small block AND we haven't already trimmed a lot of bottom whitespace
bottom_whitespace = image.height - (content_top + content_height)
if bottom_whitespace < significant_trim_threshold &&
content_blocks.size > 1 &&
content_blocks.last[:height] < maximum_artifact_block_height
content_blocks.pop
blocks_removed = true
end
if blocks_removed
# Step 5: Second find_trim pass on remaining content
content_start_y = content_blocks.first[:start_y]
content_end_y = content_blocks.last[:end_y]
main_content = trimmed_image.crop(0, content_start_y, content_width, content_end_y - content_start_y + 1)
begin
# Run a second pass of find_trim to tighten up any newly exposed whitespace
inner_left, inner_top, inner_width, inner_height = main_content.find_trim(threshold:)
rescue Vips::Error
return # Now it's a blank page
end
# Map the inner coordinates back to absolute coordinates on the original image
final_left = content_left + inner_left
final_top = content_top + content_start_y + inner_top
else
# Skip the second trim entirely if we didn't remove any artifact blocks
final_left = content_left
final_top = content_top
inner_width = content_width
inner_height = content_height
end
apply_padding(image, final_left, final_top, inner_width, inner_height, padding)
end
private
def self.find_content_blocks(row_ink_densities, gap_height, total_height)
blocks = []
block_start = nil
gap_start = nil
row_ink_densities.each_with_index do |density, y|
if density > 0
if block_start.nil?
block_start = y
elsif gap_start && (y - gap_start) > gap_height
blocks << { start_y: block_start, end_y: gap_start - 1, height: gap_start - block_start }
block_start = y
end
gap_start = nil
else
gap_start ||= y
end
end
if block_start
end_y = gap_start ? gap_start - 1 : total_height - 1
blocks << { start_y: block_start, end_y: end_y, height: end_y - block_start + 1 }
end
blocks
end
def self.apply_padding(original_image, left, top, width, height, padding)
pad_left = [0, left - padding].max
pad_top = [0, top - padding].max
pad_width = [original_image.width - pad_left, width + (left - pad_left) + padding].min
pad_height = [original_image.height - pad_top, height + (top - pad_top) + padding].min
original_image.crop(pad_left, pad_top, pad_width, pad_height)
end
end
end
Usage example
require 'vips'
image = Vips::Image.new_from_file('input.png')
image = image.thumbnail_image(1500, height: 1500, size: :down)
cropped = Autocrop.perform(image)
cropped&.write_to_file('output.png')
Use at your own risk
This works well for typical document pages with body text. Pages with scattered content (infographics, posters) may lose relevant information. Test with representative samples.
Alternative: OCR-driven strip segmentation (split instead of trim)
The autocrop above reduces a page to one tight bounding box - ideal when you send the whole page to the model. Sometimes you want the opposite: cut a page into several crops and run inference per crop. This helps when one page holds many distinct items: a single full-page call forces the model to track all of them at once, while a per-strip call keeps each prompt small and the effective resolution high.
The core idea is the same as the autocrop - group content into vertical blocks separated by whitespace gaps - but the goal flips from trimming to slicing, and we drive it from OCR word boxes instead of an ink-density projection.
How it differs
| Autocrop (above) | OCR-driven slicing | |
|---|---|---|
| Goal | 1 page → 1 tight box | 1 page → N strips |
| Block source | ink-density projected onto the Y axis (OCR-free) | Tesseract word boxes → lines → gaps |
| Gap threshold | relative (~10% of page height) - only real section breaks | small absolute (~20px) - we want many cuts |
| Output | cropped page | per-strip crops fed to the model individually |
Pipeline
-
OCR for layout, not text. Run Tesseract (
--psm 6) and keep only the word boxes (x, y, w, h, conf). Drop left-margin artifacts (short / low-confidence words near the page edge). - Cluster words into lines by their Y-center, with a tolerance of ~0.55 × median word height.
- Pick cuts in the gaps. Compute N evenly-spaced "ideal" cut positions, then snap each to the nearest blank gap between two lines (gap ≥ 20px). Cuts always land in whitespace, so a logical unit is never split mid-line.
-
Crop each strip with a constant horizontal content box (min/max word X + padding) and the strip's Y range. The horizontal box is effectively a smarter
find_trimthat ignores margin noise.
# Snap evenly-spaced ideal cuts onto real whitespace gaps.
def pick_cuts(lines, target_strips: 5, min_gap_px: 20)
gaps = lines.each_cons(2).filter_map do |a, b|
(a.bottom + b.top) / 2 if (b.top - a.bottom) >= min_gap_px
end
return [] if gaps.empty? || target_strips <= 1
top, height = lines.first.top, lines.last.bottom - lines.first.top
ideal = (1...target_strips).map { |i| top + height * i / target_strips }
ideal.map { |y| gaps.min_by { |g| (g - y).abs } }.uniq.sort
end
pick_cuts only returns the Y positions of the cuts. Turning those into actual image crops is two more steps: build the strip boundaries, then crop each strip out of the source PNG. The boundaries are anchored on the first/last line (not the page edges), so the top and bottom margins never end up in a crop:
require "vips"
# A page is a list of OCR'd `lines`, each responding to top/bottom and
# left/right (the min/max X of its words). `image` is the rendered page.
def strips_for(image, lines, padding: 12)
cuts = pick_cuts(lines)
boundaries = [lines.first.top, *cuts, lines.last.bottom]
# One constant horizontal box for the whole page: tightest column that
# holds all text, ignoring margin noise. This is the "smart find_trim".
x_left = lines.map(&:left).min - padding
x_right = lines.map(&:right).max + padding
boundaries.each_cons(2).filter_map do |y0, y1|
next if lines.none? { |l| l.top >= y0 && l.bottom <= y1 } # skip empty strips
top = [y0 - padding, 0].max
bottom = [y1 + padding, image.height].min
image.crop(x_left, top, x_right - x_left, bottom - top)
end
end
Each returned strip is an independent image you feed to the model on its own:
strips_for(image, lines).each do |strip|
png = strip.write_to_buffer(".png")
result = run_inference(png) # one small, high-resolution call per strip
# …collect results, then stitch them back together in reading order
end
Trade-offs to keep in mind
- Depends on OCR. Anything Tesseract can't read - handwriting, stamps, faint or colored ink - is invisible to both the cut logic and the horizontal crop bounds, so it can be clipped. If that content matters, fall back to the ink-density projection from the autocrop as a second source for the bounds; it sees ink the OCR misses.
- Header/footer/page numbers are not removed by this slicing step (they end up in the first/last strip). The autocrop's "small leading/trailing block = artifact" rule is worth porting if they pollute your prompts.
- Resolution is decoupled from cost. Because each call only sees its strip, you can render the page higher than you would for a single full-page call without blowing up per-call token count.