It's 2024 and we have tools like ffmpeg, imagemagick and GPT readily available. With them, it's easy to convert texts, images, audio and video clips into each other.

UI/UX Design by

We make sure that your target audience has the best possible experience with your digital product. You get:

Design tailored to your audience
Proven processes customized to your needs
An expert team of experienced designers

Usage

text-to-image "parmiggiano cheese wedding cake, digital art"
text-to-audio "Yesterday I ate some tasty parmiggiano cheese at a wedding. It was the cake!" cake.mp3
audio-to-text /path/to/cake.mp3
image-to-text /path/to/cake.jpg
video-to-text /path/to/cake.mp4
video-to-video /path/to/rickroll.mov rickroll.mp4
video-to-audio /path/to/cake.mp4 cake.mp3
audio-to-audio /path/to/cake.mp3 cake.aac
image-to-image /path/to/cake.png cake.jpg

stateDiagram-v2
    text --> image: Dall-E 3
    text --> audio: GPT TTS
    image --> text: GPT Vision
    audio --> audio: ffmpeg
    audio --> text: GPT STT
    video --> text: GPT STT
    video --> audio: ffmpeg
    video --> video: ffmpeg
    image --> image: imagemagick

Prerequisites

~/bin should be part of your $PATH
The ENV key $OPENAI_API_KEY must be populated with a valid and charged API key Show archive.org snapshot
The ruby version for the script must run gem install ruby-openai once
For video-to-X you need a ffmpeg binary in your $PATH variable
For image-to-X you need a convert binary (imagemagick) in your $PATH variable
The files below must be executable (chmod +x)

Scripts

Note

All GPT commands below cost money. Not much though, most of the time less than one cent!

`~/bin/text-to-image`

#!/usr/bin/env ruby

require 'openai'
prompt = ARGV[0]

if prompt.to_s.strip == ''
  puts 'Usage: generate-image "parmiggiano cheese wedding cake, digital art"'
  exit
end

client = OpenAI::Client.new(access_token: ENV.fetch('OPENAI_API_KEY'))
puts client.images.generate(parameters: { prompt: prompt, model: 'dall-e-3', size: '1024x1024' }).dig("data", 0, "url")

`~/bin/text-to-audio`

#!/usr/bin/env ruby

require 'openai'
prompt = ARGV[0]
output_path = ARGV[1] || 'output.mp3'

if prompt.to_s.strip == ''
  puts 'Usage: text-to-audio "Yesterday I ate some tasty parmiggiano cheese at a wedding. It was the cake!" cake.mp3'
  exit
end

client = OpenAI::Client.new(access_token: ENV.fetch('OPENAI_API_KEY'))
response = client.audio.speech(parameters: { input: prompt, model: 'tts-1', voice: 'alloy' })
File.binwrite(output_path, response)

puts "You can find the TTS result at #{output_path}"

`~/bin/audio-to-text`

#!/usr/bin/env ruby

require 'openai'
audio_path = ARGV[0]

if audio_path.to_s.strip == ''
  puts 'Usage: audio-to-text /path/to/techno.mp3'
  exit
end

client = OpenAI::Client.new(access_token: ENV.fetch('OPENAI_API_KEY'))
puts client.audio.transcribe(parameters: { model: 'whisper-1', file: File.open(audio_path, 'r') }).dig("text")

`~/bin/image-to-text`

#!/usr/bin/env ruby

require 'openai'
require 'base64'
image_path = ARGV[0]

if image_path.to_s.strip == ''
  puts 'Usage: image-to-text /path/to/cake.jpg'
  exit
end

base64_image = Base64.encode64(File.read(image_path))
client = OpenAI::Client.new(access_token: ENV.fetch('OPENAI_API_KEY'))
puts client.chat(parameters: {
  model: 'gpt-4-vision-preview',
  messages: [{
    role: 'user',
    content: [
     { "type": "text", "text": "What’s in this image?"},
     { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,#{base64_image}" }},
    ]
  }]
}).dig("choices", 0, "message", "content")

`~/bin/video-to-text`

GPT is also able to transcode videos, so this is just an alias for audio-to-text.

But: It's probably cheaper and faster to use video-to-audio and then pipe the result to audio-to-text.

ln -s ~/bin/audio-to-text ~/bin/video-to-text

`~/bin/video-to-video`

#!/usr/bin/env ruby

input_path = ARGV[0]
output_filename = ARGV[1]

if input_path.to_s.strip == '' || output_filename.to_s.strip == ''
  puts 'Usage: video-to-video /path/to/rickroll.mov rickroll.mp4'
  exit
end

`ffmpeg -i #{input_path} #{output_filename}`

puts "File transcoded to #{output_filename}"

`~/bin/video-to-audio`

#!/usr/bin/env ruby

video_path = ARGV[0]
output_filename = ARGV[1] || 'output.mp3'

if video_path.to_s.strip == ''
  puts 'Usage: video-to-audio /path/to/rickroll.mp4 rickroll.mp3'
  exit
end

`ffmpeg -i #{video_path} -vn -acodec libmp3lame -q:a 4 #{output_filename}`

puts "File transcoded to #{output_filename}"

`~/bin/audio-to-audio`

#!/usr/bin/env ruby

input_path = ARGV[0]
output_filename = ARGV[1]

if input_path.to_s.strip == '' || output_filename.to_s.strip == ''
  puts 'Usage: audio-to-audio /path/to/rickroll.mp3 rickroll.aac'
  exit
end

`ffmpeg -i #{input_path} -vn -q:a 4 #{output_filename}`

puts "File transcoded to #{output_filename}"

`~/bin/image-to-image`

#!/usr/bin/env ruby

input_path = ARGV[0]
output_filename = ARGV[1]

if input_path.to_s.strip == '' || output_filename.to_s.strip == ''
  puts 'Usage: image-to-image /path/to/cake.png cake.jpg'
  exit
end

`convert #{input_path} #{output_filename}`

puts "File transcoded to #{output_filename}"

parmiggiano.png

List of handy Ruby scripts to transcode different file types (often by using GPT)

UI/UX Design by

Usage

Prerequisites

Scripts

Note

~/bin/text-to-image

~/bin/text-to-audio

~/bin/audio-to-text

~/bin/image-to-text

~/bin/video-to-text

~/bin/video-to-video

~/bin/video-to-audio

~/bin/audio-to-audio

~/bin/image-to-image

`~/bin/text-to-image`

`~/bin/text-to-audio`

`~/bin/audio-to-text`

`~/bin/image-to-text`

`~/bin/video-to-text`

`~/bin/video-to-video`

`~/bin/video-to-audio`

`~/bin/audio-to-audio`

`~/bin/image-to-image`