OpenAI TTS: How to generate audio samples with more than 4096 characters

Posted 5 months ago. Visible to the public.

OpenAI is currently limiting the Audio generating API endpoint to text bodies with a maximum of 4096 characters.
You can work around that limit by splitting the text into smaller fragments and stitch together the resulting mp3 files with a CLI tool like mp3wrap or ffmpeg.

Example Ruby Implementation

Usage

input_text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Mi eget mauris pharetra et ultrices neque."
output_mp3_path = Rails.root.join("tts/ipsum.mp3")
new Synthesizer(input_text, output_mp3_path).to_mp3

Code


require "ruby-openai"
require "active_support/all"

##
# This class creates text to speech synthesis for smaller text chunks (roughly ~3800 characters long)
# and combines them afterwards.
# To avoid artifacts like cut-off words between the chunks we try to find the end of the current
# sentence of each text fragment.
#
class Synthesizer
  OPENAI_CHARACTER_LIMIT = 4096
  OPENAI_MODEL = "tts-1-hd"
  OPENAI_VOICE = "alloy"
  TEXT_FRAGMENT_BUFFER = 300
  SENTENCE_REGEX = /(?<sentence>.*\.)/

  def initialize(input_text, output_mp3_path)
    @input_text = input_text
    @output_mp3_path = output_mp3_path
  end

  def to_mp3
    mp3_fragment_paths = synthesize_fragments
    combine_files(mp3_fragment_paths)
    FileUtils.rm(mp3_fragment_paths)
    @output_mp3_path
  end

  private

  def synthesize_fragments
    text = @input_text.dup
    mp3_fragment_paths = []
    while text.present? do
      text_fragment = text[0, OPENAI_CHARACTER_LIMIT - TEXT_FRAGMENT_BUFFER]
      text.delete_prefix!(text_fragment)
      if SENTENCE_REGEX.match?(text)
        sentence_rest = SENTENCE_REGEX.match(text)[:sentence]
        raise ArgumentError "Adding the remaining sentence would exceed OpenAI's TTS limit" if sentence_rest.size > TEXT_FRAGMENT_BUFFER
        text.delete_prefix!(sentence_rest)
        text_fragment << sentence_rest
      end
      puts "Generating audio fragment with #{text_fragment.size} characters (remaining: #{text.size})"
      tts_response = openapi_client.audio.speech(
        parameters: {
          model: OPENAI_MODEL,
          input: text_fragment,
          voice: OPENAI_VOICE,
        }
      )
      mp3_fragment_path = @output_mp3_path.sub(/\.mp3$/, "#{Time.now.to_i}_#{output_paths.count}.mp3")
      File.binwrite(mp3_fragment_path, tts_response)
      mp3_fragment_paths << mp3_fragment_path
    end
    mp3_fragment_paths
  end

  def combine_files(mp3_fragment_paths)
    cli_command = "mp3wrap #{@output_mp3_path} #{mp3_fragment_paths.join(' ')}"
    _stdout_string, stderr_string, status = Open3.capture3(cli_command)
    raise "Could not combine mp3s: #{stderr_string}" unless status.exitstatus == 0
  end

  def openapi_client
    OpenAI::Client.new(access_token: ENV["API_TOKEN"])
  end
end

Note

Use at your own risk. If you plan to use it in production, please add some tests - the code above was written as part of hackathon.
The tts-1-hd model used above is quite expensive. Be careful when talking to it in a while loop like above.

Michael Leimstädtner
Last edit
5 months ago
Michael Leimstädtner
License
Source code in this card is licensed under the MIT License.
Posted by Michael Leimstädtner to makandra dev (2023-12-11 12:41)