Ruby: Fixing strings with invalid encoding and converting to UTF-8

Posted . Visible to the public.

When dealing with external data sources, you may have to deal with improperly encoded strings.
While you should prefer deciding on a single encoding with the data-providing party, you can not always force that on external sources.
It gets worse when you receive data with encoding declaration that does not reliably fit the accompanying string bytes.

Here is a Ruby class that helps converting such strings to a proper encoding.
Note that it tries several approaches of changing the encoding. This is not a silver bullet and may or may not work in your case. It did work for me.

Be careful when adding extra encodings. Byte sequences which you would consider invalid might suddenly make sense for #encode.
This very likely works best when your strings are in few languages that share a similar character set, e.g. English and German, and will probably not work well when you need to support different character sets.

If you do not care too much about occasionally missing characters, maybe just encode strings with placeholders (see "last resort" below) and ignore the rest.

class EncodingFixer

  def initialize(input)
    @input = input
  end

  attr_reader :input

  def utf8
    output = input if input.encoding.name == 'UTF-8' && input.valid_encoding?

    # Try converting from the string's given encoding.
    output ||= try_conversion { input.encode('UTF-8') }

    # String are sometimes composed of utf-8 bytes while not using utf-8 encoding.
    output ||= try_conversion { input.force_encoding('UTF-8') }

    # Try interpreting input as Windows-1252, because we've seen many such strings with incorrect encoding.
    output ||= try_conversion { input.encode('UTF-8', 'Windows-1252') }

    # Add any extra conversions that might make sense in your case.
    output ||= try_conversion { input.encode('UTF-8', 'ASCII-8BIT') }
    output ||= try_conversion { input.encode('UTF-8', 'US-ASCII') }

    # As a last resort, replace any unknown characters with a placeholder: �
    output ||= try_conversion { input.encode('UTF-8', invalid: :replace, undef: :replace) }

    output
  end

  private

  def try_conversion(&block)
    string = yield
    string if string.valid_encoding?
  rescue Encoding::UndefinedConversionError, Encoding::InvalidByteSequenceError
    nil
  end

end

Here is a spec that goes along with it.

describe EncodingFixer do

  describe '#utf8' do
    matcher :convert_to_utf8 do |output|
      match do |subject|
        result = subject.utf8

        expect(result.encoding.name).to eq('UTF-8')
        expect(result).to eq(output)
      end
    end

    context 'when given an UTF-8 string' do
      subject { described_class.new('Grüße') }
      it { is_expected.to convert_to_utf8('Grüße') }
    end

    context 'when given a string with non-UTF-8 encoding' do
      subject { described_class.new("Gr\xFC\xDFe".force_encoding('Windows-1252')) }
      it { is_expected.to convert_to_utf8('Grüße') }
    end

    context 'when given a string with incorrect encoding' do
      subject { described_class.new("Gr\xFC\xDFe".force_encoding('ASCII-8BIT')) } # actually Windows-1252
      it { is_expected.to convert_to_utf8('Grüße') }
    end

    context 'when given a string with invalid characters' do
      subject { described_class.new("Gr\x80\x81e".force_encoding('ASCII-8BIT')) } # never valid
      it { is_expected.to convert_to_utf8('Gr��e') }
    end
  end

end
Arne Hartherz
Last edit
Arne Hartherz
License
Source code in this card is licensed under the MIT License.
Posted by Arne Hartherz to makandra dev (2021-05-04 10:15)