When working with file uploads, we sometimes need to process intrinsic properties like the page count or page dimensions of PDF files. Retrieving those properties requires us to download (from S3 or GlusterFS) and parse the file, which is slow and resource-intensive.
Active Storage provides the metadata column on ActiveStorage::Blob to cache these values. You can either populate this column with ad-hoc metadata caching or with custom
Analyzers
Show archive.org snapshot
.
Attachments vs. Blobs
Let's recap how Active Storage is structured:
-
ActiveStorage::Attachments are the join records connecting your application's models to the underlying files. -
ActiveStorage::Blobs represent the actual file along with its metadata.
Because a single Blob can be copied and shared across multiple Attachments, the metadata stored on a Blob should only contain intrinsic properties of the file itself (like page count, duration, dimensions etc.). Do not store domain-specific or record-specific state on a blob, as it could leak across different contexts.
Ad-hoc Metadata Caching
For isolated, single-use requirements, you can write directly to the blob's metadata hash. This evaluates the property the first time it is requested and persists it for subsequent calls.
class User
has_one_attached :pdf_resume
def page_count
return unless pdf_resume.attached?
blob = pdf_resume.blob
return blob.metadata[:page_count] if blob.metadata[:page_count].present?
blob.open do |tempfile|
pdf_info, stderr, status = Open3.capture3("pdfinfo", blob.path.to_s)
raise "pdfinfo execution failed: #{stderr}" unless status.success?
if (match = pdf_info.match(/^Pages:\s*(\d+)/))
page_count = match[1].to_i
blob.update!(metadata: blob.metadata.merge(page_count:))
end
end
blob.metadata[:page_count]
end
end
Custom Active Storage Analyzers
For properties needed across multiple models, custom Analyzer Show archive.org snapshot can be used to enrich all new uploaded files of a certain mime type.
All Analyzers must inherit from ActiveStorage::Analyzer and implement at least these two methods:
-
self.accept?(blob): Determines if this analyzer should run for a given blob. -
metadata: Returns a Hash of data to be merged into the blob'smetadatacolumn.
Active Storage usually queues an ActiveStorage::AnalyzeJob automatically when a matching file is attached. The above example could be refactored like so:
# app/util/pdf_analyzer.rb
class PdfAnalyzer < ActiveStorage::Analyzer
class Error < StandardError; end
def self.accept?(blob)
blob.content_type == "application/pdf"
end
def metadata
download_blob_to_tempfile do |file|
pdf_info, stderr, status = Open3.capture3("pdfinfo", file.path.to_s)
raise Error, "pdfinfo execution failed: #{stderr}" unless status.success?
if (match = pdf_info.match(/^Pages:\s*(\d+)/))
{ page_count: match[1].to_i }
else
{}
end
end
rescue StandardError => e
Rails.logger.error("PdfAnalyzer failed to parse metadata: #{e.message}")
{}
end
def self.analyze_later?
true # default
end
end
# config/initializers/active_storage.rb
Rails.application.config.active_storage.analyzers.append PdfAnalyzer
# db/migrate/xxxxx_add_pdf_analyzer.rb
class AddPdfAnalyzer < ActiveRecord::Migration[8.0]
def up
pdf_blobs = ActiveStorage::Blob.where(content_type: 'application/pdf')
# Mark all existing PDF blobs as to-be-analyzed-again
pdf_blobs.find_each do |blob|
updated_metadata = blob.metadata.except("analyzed", :analyzed)
blob.update_column(:metadata, updated_metadata)
end
end
end
# app/models/user.rb
def page_count
return unless pdf_resume.attached?
blob = pdf_resume.blob
blob.analyze unless blob.analyzed?
blob.metadata.fetch(:page_count)
end
Note
Note Active Storage comes with two built-in and autoloaded analyzers:
ActiveStorage::Analyzer::ImageAnalyzerandActiveStorage::Analyzer::VideoAnalyzer.