OpenAI is currently limiting the Audio generating API endpoint to text bodies with a maximum of 4096 characters.
You can work around that limit by splitting the text into smaller fragments and stitch together the resulting mp3 files with a CLI tool like mp3wrap or ffmpeg.
Example Ruby Implementation
Usage
input_text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Mi eget mauris pharetra et ultrices neque."
output_mp3_path = Rails.root.join("tts/ipsum.mp3")
new Synthesizer(input_text, output_mp3_path).to_mp3
Code
require "ruby-openai"
require "active_support/all"
##
# This class creates text to speech synthesis for smaller text chunks (roughly ~3800 characters long)
# and combines them afterwards.
# To avoid artifacts like cut-off words between the chunks we try to find the end of the current
# sentence of each text fragment.
#
class Synthesizer
OPENAI_CHARACTER_LIMIT = 4096
OPENAI_MODEL = "tts-1-hd"
OPENAI_VOICE = "alloy"
TEXT_FRAGMENT_BUFFER = 300
SENTENCE_REGEX = /(?<sentence>.*\.)/
def initialize(input_text, output_mp3_path)
@input_text = input_text
@output_mp3_path = output_mp3_path
end
def to_mp3
mp3_fragment_paths = synthesize_fragments
combine_files(mp3_fragment_paths)
FileUtils.rm(mp3_fragment_paths)
@output_mp3_path
end
private
def synthesize_fragments
text = @input_text.dup
mp3_fragment_paths = []
while text.present? do
text_fragment = text[0, OPENAI_CHARACTER_LIMIT - TEXT_FRAGMENT_BUFFER]
text.delete_prefix!(text_fragment)
if SENTENCE_REGEX.match?(text)
sentence_rest = SENTENCE_REGEX.match(text)[:sentence]
raise ArgumentError "Adding the remaining sentence would exceed OpenAI's TTS limit" if sentence_rest.size > TEXT_FRAGMENT_BUFFER
text.delete_prefix!(sentence_rest)
text_fragment << sentence_rest
end
puts "Generating audio fragment with #{text_fragment.size} characters (remaining: #{text.size})"
tts_response = openapi_client.audio.speech(
parameters: {
model: OPENAI_MODEL,
input: text_fragment,
voice: OPENAI_VOICE,
}
)
mp3_fragment_path = @output_mp3_path.sub(/\.mp3$/, "#{Time.now.to_i}_#{output_paths.count}.mp3")
File.binwrite(mp3_fragment_path, tts_response)
mp3_fragment_paths << mp3_fragment_path
end
mp3_fragment_paths
end
def combine_files(mp3_fragment_paths)
cli_command = "mp3wrap #{@output_mp3_path} #{mp3_fragment_paths.join(' ')}"
_stdout_string, stderr_string, status = Open3.capture3(cli_command)
raise "Could not combine mp3s: #{stderr_string}" unless status.exitstatus == 0
end
def openapi_client
OpenAI::Client.new(access_token: ENV["API_TOKEN"])
end
end
Note
Use at your own risk. If you plan to use it in production, please add some tests - the code above was written as part of hackathon.
Thetts-1-hd
model used above is quite expensive. Be careful when talking to it in awhile
loop like above.
Posted by Michael Leimstädtner to makandra dev (2023-12-11 12:41)