It's 2024 and we have tools like ffmpeg, imagemagick and GPT readily available. With them, it's easy to convert texts, images, audio and video clips into each other.
For the everyday use without any parameter tweaking I'm using a collection of tiny scripts in my ~/bin
folder that can then be used as bash functions. And: It's faster to use the CLI than interacting with a website and cheaper to use the API than buying GPT plus.. :-)
Usage
text-to-image "parmiggiano cheese wedding cake, digital art"
text-to-audio "Yesterday I ate some tasty parmiggiano cheese at a wedding. It was the cake!" cake.mp3
audio-to-text /path/to/cake.mp3
image-to-text /path/to/cake.jpg
video-to-text /path/to/cake.mp4
video-to-video /path/to/rickroll.mov rickroll.mp4
video-to-audio /path/to/cake.mp4 cake.mp3
audio-to-audio /path/to/cake.mp3 cake.aac
image-to-image /path/to/cake.png cake.jpg
stateDiagram-v2
text --> image: Dall-E 3
text --> audio: GPT TTS
image --> text: GPT Vision
audio --> audio: ffmpeg
audio --> text: GPT STT
video --> text: GPT STT
video --> audio: ffmpeg
video --> video: ffmpeg
image --> image: imagemagick
Prerequisites
-
~/bin
should be part of your$PATH
- The ENV key
$OPENAI_API_KEY
must be populated with a valid and charged API key Show archive.org snapshot - The ruby version for the script must run
gem install ruby-openai
once - For
video-to-X
you need affmpeg
binary in your$PATH
variable - For
image-to-X
you need aconvert
binary (imagemagick) in your$PATH
variable - The files below must be executable (
chmod +x
)
Scripts
Note
All GPT commands below cost money. Not much though, most of the time less than one cent!
~/bin/text-to-image
#!/usr/bin/env ruby
require 'openai'
prompt = ARGV[0]
if prompt.to_s.strip == ''
puts 'Usage: generate-image "parmiggiano cheese wedding cake, digital art"'
exit
end
client = OpenAI::Client.new(access_token: ENV.fetch('OPENAI_API_KEY'))
puts client.images.generate(parameters: { prompt: prompt, model: 'dall-e-3', size: '1024x1024' }).dig("data", 0, "url")
~/bin/text-to-audio
#!/usr/bin/env ruby
require 'openai'
prompt = ARGV[0]
output_path = ARGV[1] || 'output.mp3'
if prompt.to_s.strip == ''
puts 'Usage: text-to-audio "Yesterday I ate some tasty parmiggiano cheese at a wedding. It was the cake!" cake.mp3'
exit
end
client = OpenAI::Client.new(access_token: ENV.fetch('OPENAI_API_KEY'))
response = client.audio.speech(parameters: { input: prompt, model: 'tts-1', voice: 'alloy' })
File.binwrite(output_path, response)
puts "You can find the TTS result at #{output_path}"
~/bin/audio-to-text
#!/usr/bin/env ruby
require 'openai'
audio_path = ARGV[0]
if audio_path.to_s.strip == ''
puts 'Usage: audio-to-text /path/to/techno.mp3'
exit
end
client = OpenAI::Client.new(access_token: ENV.fetch('OPENAI_API_KEY'))
puts client.audio.transcribe(parameters: { model: 'whisper-1', file: File.open(audio_path, 'r') }).dig("text")
~/bin/image-to-text
#!/usr/bin/env ruby
require 'openai'
require 'base64'
image_path = ARGV[0]
if image_path.to_s.strip == ''
puts 'Usage: image-to-text /path/to/cake.jpg'
exit
end
base64_image = Base64.encode64(File.read(image_path))
client = OpenAI::Client.new(access_token: ENV.fetch('OPENAI_API_KEY'))
puts client.chat(parameters: {
model: 'gpt-4-vision-preview',
messages: [{
role: 'user',
content: [
{ "type": "text", "text": "What’s in this image?"},
{ "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,#{base64_image}" }},
]
}]
}).dig("choices", 0, "message", "content")
~/bin/video-to-text
GPT is also able to transcode videos, so this is just an alias for audio-to-text
.
But: It's probably cheaper and faster to use video-to-audio
and then pipe the result to audio-to-text
.
ln -s ~/bin/audio-to-text ~/bin/video-to-text
~/bin/video-to-video
#!/usr/bin/env ruby
input_path = ARGV[0]
output_filename = ARGV[1]
if input_path.to_s.strip == '' || output_filename.to_s.strip == ''
puts 'Usage: video-to-video /path/to/rickroll.mov rickroll.mp4'
exit
end
`ffmpeg -i #{input_path} #{output_filename}`
puts "File transcoded to #{output_filename}"
~/bin/video-to-audio
#!/usr/bin/env ruby
video_path = ARGV[0]
output_filename = ARGV[1] || 'output.mp3'
if video_path.to_s.strip == ''
puts 'Usage: video-to-audio /path/to/rickroll.mp4 rickroll.mp3'
exit
end
`ffmpeg -i #{video_path} -vn -acodec libmp3lame -q:a 4 #{output_filename}`
puts "File transcoded to #{output_filename}"
~/bin/audio-to-audio
#!/usr/bin/env ruby
input_path = ARGV[0]
output_filename = ARGV[1]
if input_path.to_s.strip == '' || output_filename.to_s.strip == ''
puts 'Usage: audio-to-audio /path/to/rickroll.mp3 rickroll.aac'
exit
end
`ffmpeg -i #{input_path} -vn -q:a 4 #{output_filename}`
puts "File transcoded to #{output_filename}"
~/bin/image-to-image
#!/usr/bin/env ruby
input_path = ARGV[0]
output_filename = ARGV[1]
if input_path.to_s.strip == '' || output_filename.to_s.strip == ''
puts 'Usage: image-to-image /path/to/cake.png cake.jpg'
exit
end
`convert #{input_path} #{output_filename}`
puts "File transcoded to #{output_filename}"
Posted by Michael Leimstädtner to makandra dev (2024-03-14 16:06)