Read more

OpenAI TTS: How to generate audio samples with more than 4096 characters

Michael Leimstädtner
December 11, 2023Software engineer at makandra GmbH

OpenAI is currently limiting the Audio generating API endpoint to text bodies with a maximum of 4096 characters.
You can work around that limit by splitting the text into smaller fragments and stitch together the resulting mp3 files with a CLI tool like mp3wrap or ffmpeg.

Example Ruby Implementation

Usage

input_text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Mi eget mauris pharetra et ultrices neque."
output_mp3_path = Rails.root.join("tts/ipsum.mp3")
new Synthesizer(input_text, output_mp3_path).to_mp3

Code


require "ruby-openai"
require "active_support/all"

##
# This class creates text to speech synthesis for smaller text chunks (roughly ~3800 characters long)
# and combines them afterwards.
# To avoid artifacts like cut-off words between the chunks we try to find the end of the current
# sentence of each text fragment.
#
class Synthesizer
  OPENAI_CHARACTER_LIMIT = 4096
  OPENAI_MODEL = "tts-1-hd"
  OPENAI_VOICE = "alloy"
  TEXT_FRAGMENT_BUFFER = 300
  SENTENCE_REGEX = /(?<sentence>.*\.)/

  def initialize(input_text, output_mp3_path)
    @input_text = input_text
    @output_mp3_path = output_mp3_path
  end

  def to_mp3
    mp3_fragment_paths = synthesize_fragments
    combine_files(mp3_fragment_paths)
    FileUtils.rm(mp3_fragment_paths)
    @output_mp3_path
  end

  private

  def synthesize_fragments
    text = @input_text.dup
    mp3_fragment_paths = []
    while text.present? do
      text_fragment = text[0, OPENAI_CHARACTER_LIMIT - TEXT_FRAGMENT_BUFFER]
      text.delete_prefix!(text_fragment)
      if SENTENCE_REGEX.match?(text)
        sentence_rest = SENTENCE_REGEX.match(text)[:sentence]
        raise ArgumentError "Adding the remaining sentence would exceed OpenAI's TTS limit" if sentence_rest.size > TEXT_FRAGMENT_BUFFER
        text.delete_prefix!(sentence_rest)
        text_fragment << sentence_rest
      end
      puts "Generating audio fragment with #{text_fragment.size} characters (remaining: #{text.size})"
      tts_response = openapi_client.audio.speech(
        parameters: {
          model: OPENAI_MODEL,
          input: text_fragment,
          voice: OPENAI_VOICE,
        }
      )
      mp3_fragment_path = @output_mp3_path.sub(/\.mp3$/, "#{Time.now.to_i}_#{output_paths.count}.mp3")
      File.binwrite(mp3_fragment_path, tts_response)
      mp3_fragment_paths << mp3_fragment_path
    end
    mp3_fragment_paths
  end

  def combine_files(mp3_fragment_paths)
    cli_command = "mp3wrap #{@output_mp3_path} #{mp3_fragment_paths.join(' ')}"
    _stdout_string, stderr_string, status = Open3.capture3(cli_command)
    raise "Could not combine mp3s: #{stderr_string}" unless status.exitstatus == 0
  end

  def openapi_client
    OpenAI::Client.new(access_token: ENV["API_TOKEN"])
  end
end

Note

Use at your own risk. If you plan to use it in production, please add some tests - the code above was written as part of hackathon.
The tts-1-hd model used above is quite expensive. Be careful when talking to it in a while loop like above.

Illustration web development

Do you need DevOps-experts?

Your development team has a full backlog? No time for infrastructure architecture? Our DevOps team is ready to support you!

  • We build reliable cloud solutions with Infrastructure as code
  • We are experts in security, Linux and databases
  • We support your dev team to perform
Read more Show archive.org snapshot
Michael Leimstädtner
December 11, 2023Software engineer at makandra GmbH
Posted by Michael Leimstädtner to makandra dev (2023-12-11 13:41)