Ollama: Log generation runtime

Posted . Visible to the public.

When an LLM is part of a request, you want to keep track of how much of the runtime was spent waiting on it, how many calls it took, and the ratio of input to output tokens without opening a debugger every time. Passively watching these numbers catches prompt bloat, runaway tool-call loops, and regressions from model changes just by scrolling the log. The same pattern applies to any slow external API you want visibility on.

Hooking into Rails' controller instrumentation allows us to get additional information like this in every log line:

Completed 200 OK in 12929ms (Views: 32.0ms | ActiveRecord: 6.7ms (6 queries, 0 cached) | Ollama: 11443.1ms (queries: 8, input: 1218, output: 158) | GC: 87.7ms)

This extends Rails 7 Server Timings, so the same numbers also show up in the browser's DevTools "Server Timing" panel.

Collecting the metric

Aggregate per-request Ollama stats in a CurrentAttributes and override log_process_action Show archive.org snapshot to append them to the controller's log line:

# app/models/ollama_runtime.rb
class OllamaRuntime < ActiveSupport::CurrentAttributes

  attribute :runtime, :query_count, :input_tokens, :output_tokens

  def self.register_response_timing(runtime:, input_tokens:, output_tokens:)
    self.query_count = query_count.to_i + 1
    self.runtime = self.runtime.to_f + runtime
    self.input_tokens = self.input_tokens.to_i + input_tokens.to_i
    self.output_tokens = self.output_tokens.to_i + output_tokens.to_i
  end

  def self.to_log_line
    format(
      "Ollama: %<runtime>.1fms (queries: %<queries>d, input: %<input>d, output: %<output>d)",
      runtime: runtime,
      queries: query_count,
      input: input_tokens,
      output: output_tokens,
    )
  end

  module ControllerExtension
    extend ActiveSupport::Concern

    module ClassMethods
      def log_process_action(payload)
        messages = super
        messages << OllamaRuntime.to_log_line if OllamaRuntime.query_count.to_i.positive?
        messages
      end
    end
  end

end

Subscribe to a custom request.ollama event and accumulate runtime and token counts into the CurrentAttributes:

# config/initializers/ollama_instrumentation.rb
ActiveSupport::Notifications.subscribe('request.ollama') do |*args|
  event = ActiveSupport::Notifications::Event.new(*args)
  OllamaRuntime.register_response_timing(
    runtime: event.duration,
    input_tokens: event.payload[:input_tokens],
    output_tokens: event.payload[:output_tokens],
  )
end

ActiveSupport.on_load(:action_controller) do
  include OllamaRuntime::ControllerExtension
end

Finally, wrap all actual Ollama HTTP calls in ActiveSupport::Notifications.instrument and populate the payload with the desired metric from Ollama's response:

# app/util/ollama.rb
ActiveSupport::Notifications.instrument('request.ollama') do |payload|
  begin
    response = HTTP.post('https://my-ollama-instance.com/api/chat', json: body)

    if response.status.success?
      parsed = response.parse
      payload[:input_tokens] = parsed['prompt_eval_count'] # secret_sentry:no_secret
      payload[:output_tokens] = parsed['eval_count'] # secret_sentry:no_secret
    end
  rescue HTTP::Error, JSON::ParserError => e
    Rails.logger.error("[Ollama] Failed to parse response for token counts: #{e.message}")
  end
end

A note on background jobs

LLM calls tend to be slow, so apart from maybe embedding generation they should usually be wrapped in a background job. The ControllerExtension above will not have any effect for them.

CurrentAttributes are reset per request in Rails, and since Rails 6.1 ActiveJob also resets them around each perform, so no manual teardown is needed. You only need to log the totals yourself:

# app/jobs/application_job.rb
class ApplicationJob < ActiveJob::Base

  around_perform do |job, block|
    block.call
    next unless OllamaRuntime.query_count.to_i.positive?

    Rails.logger.info("[#{job.class.name}] #{OllamaRuntime.to_log_line}")
  end

end
Profile picture of Michael Leimstädtner
Michael Leimstädtner
License
Source code in this card is licensed under the MIT License.
Posted by Michael Leimstädtner to makandra dev (2026-04-22 10:53)