When an LLM is part of a request, you want to keep track of how much of the runtime was spent waiting on it, how many calls it took, and the ratio of input to output tokens without opening a debugger every time. Passively watching these numbers catches prompt bloat, runaway tool-call loops, and regressions from model changes just by scrolling the log. The same pattern applies to any slow external API you want visibility on.
Hooking into Rails' controller instrumentation allows us to get additional information like this in every log line:
Completed 200 OK in 12929ms (Views: 32.0ms | ActiveRecord: 6.7ms (6 queries, 0 cached) | Ollama: 11443.1ms (queries: 8, input: 1218, output: 158) | GC: 87.7ms)
This extends Rails 7 Server Timings, so the same numbers also show up in the browser's DevTools "Server Timing" panel.
Collecting the metric
Aggregate per-request Ollama stats in a CurrentAttributes and override
log_process_action
Show archive.org snapshot
to append them to the controller's log line:
# app/models/ollama_runtime.rb
class OllamaRuntime < ActiveSupport::CurrentAttributes
attribute :runtime, :query_count, :input_tokens, :output_tokens
def self.register_response_timing(runtime:, input_tokens:, output_tokens:)
self.query_count = query_count.to_i + 1
self.runtime = self.runtime.to_f + runtime
self.input_tokens = self.input_tokens.to_i + input_tokens.to_i
self.output_tokens = self.output_tokens.to_i + output_tokens.to_i
end
def self.to_log_line
format(
"Ollama: %<runtime>.1fms (queries: %<queries>d, input: %<input>d, output: %<output>d)",
runtime: runtime,
queries: query_count,
input: input_tokens,
output: output_tokens,
)
end
module ControllerExtension
extend ActiveSupport::Concern
module ClassMethods
def log_process_action(payload)
messages = super
messages << OllamaRuntime.to_log_line if OllamaRuntime.query_count.to_i.positive?
messages
end
end
end
end
Subscribe to a custom request.ollama event and accumulate runtime and token counts into the CurrentAttributes:
# config/initializers/ollama_instrumentation.rb
ActiveSupport::Notifications.subscribe('request.ollama') do |*args|
event = ActiveSupport::Notifications::Event.new(*args)
OllamaRuntime.register_response_timing(
runtime: event.duration,
input_tokens: event.payload[:input_tokens],
output_tokens: event.payload[:output_tokens],
)
end
ActiveSupport.on_load(:action_controller) do
include OllamaRuntime::ControllerExtension
end
Finally, wrap all actual Ollama HTTP calls in ActiveSupport::Notifications.instrument and populate the payload with the desired metric from Ollama's response:
# app/util/ollama.rb
ActiveSupport::Notifications.instrument('request.ollama') do |payload|
begin
response = HTTP.post('https://my-ollama-instance.com/api/chat', json: body)
if response.status.success?
parsed = response.parse
payload[:input_tokens] = parsed['prompt_eval_count'] # secret_sentry:no_secret
payload[:output_tokens] = parsed['eval_count'] # secret_sentry:no_secret
end
rescue HTTP::Error, JSON::ParserError => e
Rails.logger.error("[Ollama] Failed to parse response for token counts: #{e.message}")
end
end
A note on background jobs
LLM calls tend to be slow, so apart from maybe embedding generation they should usually be wrapped in a background job. The ControllerExtension above will not have any effect for them.
CurrentAttributes are reset per request in Rails, and since Rails 6.1 ActiveJob also resets them around each perform, so no manual teardown is needed. You only need to log the totals yourself:
# app/jobs/application_job.rb
class ApplicationJob < ActiveJob::Base
around_perform do |job, block|
block.call
next unless OllamaRuntime.query_count.to_i.positive?
Rails.logger.info("[#{job.class.name}] #{OllamaRuntime.to_log_line}")
end
end