Posted over 1 year ago. Visible to the public.

Ruby: Fixing strings with invalid encoding and converting to UTF-8

When dealing with external data sources, you may have to deal with improperly encoded strings.
While you should prefer deciding on a single encoding with the data-providing party, you can not always force that on external sources.
It gets worse when you receive data with encoding declaration that does not reliably fit the accompanying string bytes.

Here is a Ruby class that helps converting such strings to a proper encoding.
Note that it tries several approaches of changing the encoding. This is not a silver bullet and may or may not work in your case. It did work for me.

Be careful when adding extra encodings. Byte sequences which you would consider invalid might suddenly make sense for #encode.
This very likely works best when your strings are in few languages that share a similar character set, e.g. English and German, and will probably not work well when you need to support different character sets.

If you do not care too much about occasionally missing characters, maybe just encode strings with placeholders (see "last resort" below) and ignore the rest.

class EncodingFixer def initialize(input) @input = input end attr_reader :input def utf8 output = input if == 'UTF-8' && input.valid_encoding? # Try converting from the string's given encoding. output ||= try_conversion { input.encode('UTF-8') } # String are sometimes composed of utf-8 bytes while not using utf-8 encoding. output ||= try_conversion { input.force_encoding('UTF-8') } # Try interpreting input as Windows-1252, because we've seen many such strings with incorrect encoding. output ||= try_conversion { input.encode('UTF-8', 'Windows-1252') } # Add any extra conversions that might make sense in your case. output ||= try_conversion { input.encode('UTF-8', 'ASCII-8BIT') } output ||= try_conversion { input.encode('UTF-8', 'US-ASCII') } # As a last resort, replace any unknown characters with a placeholder: � output ||= try_conversion { input.encode('UTF-8', invalid: :replace, undef: :replace) } output end private def try_conversion(&block) string = yield string if string.valid_encoding? rescue Encoding::UndefinedConversionError, Encoding::InvalidByteSequenceError nil end end

Here is a spec that goes along with it.

describe EncodingFixer do describe '#utf8' do matcher :convert_to_utf8 do |output| match do |subject| result = subject.utf8 expect( eq('UTF-8') expect(result).to eq(output) end end context 'when given an UTF-8 string' do subject {'Grüße') } it { convert_to_utf8('Grüße') } end context 'when given a string with non-UTF-8 encoding' do subject {"Gr\xFC\xDFe".force_encoding('Windows-1252')) } it { convert_to_utf8('Grüße') } end context 'when given a string with incorrect encoding' do subject {"Gr\xFC\xDFe".force_encoding('ASCII-8BIT')) } # actually Windows-1252 it { convert_to_utf8('Grüße') } end context 'when given a string with invalid characters' do subject {"Gr\x80\x81e".force_encoding('ASCII-8BIT')) } # never valid it { convert_to_utf8('Gr��e') } end end end
Growing Rails Applications in Practice
Check out our new e-book:
Learn to structure large Ruby on Rails codebases with the tools you already know and love.

Owner of this card:

Arne Hartherz
Last edit:
over 1 year ago
by Arne Hartherz
About this deck:
We are makandra and do test-driven, agile Ruby on Rails software development.
License for source code
Posted by Arne Hartherz to makandra dev
This website uses short-lived cookies to improve usability.
Accept or learn more