OpenAI is currently limiting the Audio generating API endpoint to text bodies with a maximum of 4096 characters.
You can work around that limit by splitting the text into smaller fragments and stitch together the resulting mp3 files with a CLI tool like mp3wrap or ffmpeg.

Example Ruby Implementation

Usage

input_text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Mi eget mauris pharetra et ultrices neque."
output_mp3_path = Rails.root.join("tts/ipsum.mp3")
new Synthesizer(input_text, output_mp3_path).to_mp3

Code


require "ruby-openai"
require "active_support/all"

##
# This class creates text to speech synthesis for smaller text chunks (roughly ~3800 characters long)
# and combines them afterwards.
# To avoid artifacts like cut-off words between the chunks we try to find the end of the current
# sentence of each text fragment.
#
class Synthesizer
  OPENAI_CHARACTER_LIMIT = 4096
  OPENAI_MODEL = "tts-1-hd"
  OPENAI_VOICE = "alloy"
  TEXT_FRAGMENT_BUFFER = 300
  SENTENCE_REGEX = /(?<sentence>.*\.)/

  def initialize(input_text, output_mp3_path)
    @input_text = input_text
    @output_mp3_path = output_mp3_path
  end

  def to_mp3
    mp3_fragment_paths = synthesize_fragments
    combine_files(mp3_fragment_paths)
    FileUtils.rm(mp3_fragment_paths)
    @output_mp3_path
  end

  private

  def synthesize_fragments
    text = @input_text.dup
    mp3_fragment_paths = []
    while text.present? do
      text_fragment = text[0, OPENAI_CHARACTER_LIMIT - TEXT_FRAGMENT_BUFFER]
      text.delete_prefix!(text_fragment)
      if SENTENCE_REGEX.match?(text)
        sentence_rest = SENTENCE_REGEX.match(text)[:sentence]
        raise ArgumentError "Adding the remaining sentence would exceed OpenAI's TTS limit" if sentence_rest.size > TEXT_FRAGMENT_BUFFER
        text.delete_prefix!(sentence_rest)
        text_fragment << sentence_rest
      end
      puts "Generating audio fragment with #{text_fragment.size} characters (remaining: #{text.size})"
      tts_response = openapi_client.audio.speech(
        parameters: {
          model: OPENAI_MODEL,
          input: text_fragment,
          voice: OPENAI_VOICE,
        }
      )
      mp3_fragment_path = @output_mp3_path.sub(/\.mp3$/, "#{Time.now.to_i}_#{output_paths.count}.mp3")
      File.binwrite(mp3_fragment_path, tts_response)
      mp3_fragment_paths << mp3_fragment_path
    end
    mp3_fragment_paths
  end

  def combine_files(mp3_fragment_paths)
    cli_command = "mp3wrap #{@output_mp3_path} #{mp3_fragment_paths.join(' ')}"
    _stdout_string, stderr_string, status = Open3.capture3(cli_command)
    raise "Could not combine mp3s: #{stderr_string}" unless status.exitstatus == 0
  end

  def openapi_client
    OpenAI::Client.new(access_token: ENV["API_TOKEN"])
  end
end

Note

Use at your own risk. If you plan to use it in production, please add some tests - the code above was written as part of hackathon.
The tts-1-hd model used above is quite expensive. Be careful when talking to it in a while loop like above.

Do you need DevOps-experts?

Your development team has a full backlog? No time for infrastructure architecture? Our DevOps team is ready to support you!

We build reliable cloud solutions with Infrastructure as code
We are experts in security, Linux and databases
We support your dev team to perform

OpenAI TTS: How to generate audio samples with more than 4096 characters

Example Ruby Implementation

Usage

Code

Note

Do you need DevOps-experts?