Ruby: Fixing strings with invalid encoding and converting to UTF-8

When dealing with external data sources, you may have to deal with improperly encoded strings.
While you should prefer deciding on a single encoding with the data-providing party, you can not always force that on external sources.
It gets worse when you receive data with encoding declaration that does not reliably fit the accompanying string bytes.

Here is a Ruby class that helps converting such strings to a proper encoding.
Note that it tries several approaches of changing the encoding. This is not a silver bullet and may or may not work in your case. It did work for me.

Be careful when adding extra encodings. Byte sequences which you would consider invalid might suddenly make sense for #encode.
This very likely works best when your strings are in few languages that share a similar character set, e.g. English and German, and will probably not work well when you need to support different character sets.

If you do not care too much about occasionally missing characters, maybe just encode strings with placeholders (see "last resort" below) and ignore the rest.

class EncodingFixer

  def initialize(input)
    @input = input

  attr_reader :input

  def utf8
    output = input if == 'UTF-8' && input.valid_encoding?

    # Try converting from the string's given encoding.
    output ||= try_conversion { input.encode('UTF-8') }

    # String are sometimes composed of utf-8 bytes while not using utf-8 encoding.
    output ||= try_conversion { input.force_encoding('UTF-8') }

    # Try interpreting input as Windows-1252, because we've seen many such strings with incorrect encoding.
    output ||= try_conversion { input.encode('UTF-8', 'Windows-1252') }

    # Add any extra conversions that might make sense in your case.
    output ||= try_conversion { input.encode('UTF-8', 'ASCII-8BIT') }
    output ||= try_conversion { input.encode('UTF-8', 'US-ASCII') }

    # As a last resort, replace any unknown characters with a placeholder: �
    output ||= try_conversion { input.encode('UTF-8', invalid: :replace, undef: :replace) }



  def try_conversion(&block)
    string = yield
    string if string.valid_encoding?
  rescue Encoding::UndefinedConversionError, Encoding::InvalidByteSequenceError


Here is a spec that goes along with it.

describe EncodingFixer do

  describe '#utf8' do
    matcher :convert_to_utf8 do |output|
      match do |subject|
        result = subject.utf8

        expect( eq('UTF-8')
        expect(result).to eq(output)

    context 'when given an UTF-8 string' do
      subject {'Grüße') }
      it { convert_to_utf8('Grüße') }

    context 'when given a string with non-UTF-8 encoding' do
      subject {"Gr\xFC\xDFe".force_encoding('Windows-1252')) }
      it { convert_to_utf8('Grüße') }

    context 'when given a string with incorrect encoding' do
      subject {"Gr\xFC\xDFe".force_encoding('ASCII-8BIT')) } # actually Windows-1252
      it { convert_to_utf8('Grüße') }

    context 'when given a string with invalid characters' do
      subject {"Gr\x80\x81e".force_encoding('ASCII-8BIT')) } # never valid
      it { convert_to_utf8('Gr��e') }

Arne Hartherz 3 months ago
This website uses short-lived cookies to improve usability.
Accept or learn more