A page scanned upside down or sideways has the potential to confuse OCR engines and vision LLMs. While both are often capable of handling such inputs, the overall extraction quality tends to be better when we pass in only input with correctly oriented text.
Detecting and correcting the image orientation does not require extra hardware on our web servers, it just adds a bit of complexity to the overall pipeline.
Approach
Tesseract ships with an
Orientation and Script Detection (OSD)
Show archive.org snapshot
mode (--psm 0) that returns a rotation angle and a confidence score. We run it on a small sector of the image, read the angle, and rotate the image with libvips.
Two caveats drive the design:
- Tesseract OSD needs a minimum amount of text to be confident. Running it on a full page often picks up noise (borders, stamps, tiny footer text) and returns low-confidence garbage.
- OSD takes ~1s per A4 page, so we try to speed that up.
So we crop a few candidate sectors (center first, then corners), probe each with OSD until one clears a confidence threshold, and rotate the image accordingly. If nothing clears the threshold, we leave the image alone.
If the image is smaller than a single sector in either dimension, sector scanning makes no sense. We skip the loop and run OSD once on the whole image instead, trusting whatever confidence tesseract reports since there is no better alternative.
Implementation
This module takes a Vips::Image and returns a rotated version as another Vips::Image. This matches the interface of Image::Autocrop so the two can easily be chained.
require 'open3'
require 'vips'
class Autorotate
Result = Data.define(:angle, :confidence, :source)
MIN_CONFIDENCE = 2.0
SECTOR_SIZE = 650
PADDING = 100
TESSERACT_CMD = "tesseract stdin stdout --psm 0 -c min_characters_to_try=20".freeze
# Tesseract "Rotate: N" means "rotate N° clockwise to upright".
# libvips rot :d90 / :d180 / :d270 are clockwise rotations, so mapping is 1:1.
VIPS_ROTATIONS = {
0 => nil,
90 => :d90,
180 => :d180,
270 => :d270,
}.freeze
def self.perform(image)
result = detect(image)
rotation = VIPS_ROTATIONS[result.angle]
rotation ? image.rot(rotation) : image
end
def self.detect(image)
if image.width < SECTOR_SIZE || image.height < SECTOR_SIZE
angle, confidence = scan_whole(image)
return Result.new(angle: angle, confidence: confidence, source: 'Whole image')
end
sectors = define_sectors(image.width, image.height)
sectors.each do |sector|
angle, confidence = scan_sector(image, sector[:x], sector[:y])
if confidence >= MIN_CONFIDENCE
return Result.new(angle: angle, confidence: confidence, source: sector[:name])
end
end
Result.new(angle: 0, confidence: 0.0, source: 'No confident text')
end
def self.define_sectors(width, height)
[
{ name: 'Center', x: (width - SECTOR_SIZE) / 2, y: (height - SECTOR_SIZE) / 2 },
{ name: 'Top-Left', x: PADDING, y: PADDING },
{ name: 'Bottom-Left', x: PADDING, y: height - SECTOR_SIZE - PADDING },
{ name: 'Top-Right', x: width - SECTOR_SIZE - PADDING, y: PADDING },
{ name: 'Bottom-Right', x: width - SECTOR_SIZE - PADDING, y: height - SECTOR_SIZE - PADDING },
]
end
def self.scan_whole(image)
run_osd(image.write_to_buffer('.png'))
end
def self.scan_sector(image, x, y)
safe_x = [0, x].max
safe_y = [0, y].max
safe_width = [SECTOR_SIZE, image.width - safe_x].min
safe_height = [SECTOR_SIZE, image.height - safe_y].min
return [0, 0.0] if safe_width < 100 || safe_height < 100
run_osd(image.crop(safe_x, safe_y, safe_width, safe_height).write_to_buffer('.png'))
end
def self.run_osd(png_bytes)
stdout, _stderr, status = Open3.capture3(
TESSERACT_CMD,
stdin_data: png_bytes, binmode: true,
)
return [0, 0.0] unless status.success?
angle = stdout[/Rotate: (\d+)/, 1].to_i
confidence = stdout[/Orientation confidence: ([\d\.]+)/, 1].to_f
[angle, confidence]
end
end
There are a few variables that can be tuned:
-
SECTOR_SIZE = 650is a good default for images rendered at 150-200 DPI. Large enough to contain multiple text lines, small enough that tesseract finishes quickly. Scale it up for much larger inputs. - Sector priority: center first (body text is the cleanest signal), corners as fallback for pages with wide margins or figures in the middle.
-
MIN_CONFIDENCE = 2.0is tesseract's confidence score, not a percentage. Below this, the result is essentially a guess. Well-scanned pages typically land between 2 and 5. -
min_characters_to_try=20makes OSD bail out quickly on near-empty sectors.
Usage
require 'vips'
image = Vips::Image.new_from_file('sideways.png')
rotated = Autorotate.perform(image)
rotated.write_to_file('rotated.png')
For multi-page PDFs, render each page to an image first (pdftoppm -png -r 200 input.pdf out), or use libvips' PDF loader, then call the detector per page.
Chaining with autocrop
Orientation detection and autocrop are complementary: both clean up the image before OCR or a vision LLM sees it. Rotate first, then crop: cropping a sideways page would trim the wrong edges, and the autocrop header/footer heuristic assumes the image is already upright.
original_image = Vips::Image.new_from_file('large_sideways_image.png')
thumbnail = original_image.thumbnail_image(1500, height: 1500, size: :down)
rotated = Autorotate.perform(thumbnail)
cropped = Autocrop.perform(rotated)
cropped&.write_to_file('ready_for_llm_analysis.png')
Caveats
- Pages without enough readable text in any sector fall through to
angle: 0, confidence: 0.0. That is the intended behavior: better to leave a page alone than to rotate it based on noise. Figure-only pages, near-blank separator pages, and heavily handwritten pages will pass through unchanged. - OSD only detects rotations in 90° increments (0, 90, 180, 270). Skewed scans (a few degrees off) need a separate deskew step. Tika handles this with
detectAngles="true"(see the Tika configuration card).