When dealing with external data sources, you may have to deal with improperly encoded strings.
While you should prefer deciding on a single encoding with the data-providing party, you can not always force that on external sources.
It gets worse when you receive data with encoding declaration that does not reliably fit the accompanying string bytes.
Here is a Ruby class that helps converting such strings to a proper encoding.
Note that it tries several approaches of changing the encoding. This is not a silver bullet and may or may not work in your case. It did work for me.
Be careful when adding extra encodings. Byte sequences which you would consider invalid might suddenly make sense for #encode
.
This very likely works best when your strings are in few languages that share a similar character set, e.g. English and German, and will probably not work well when you need to support different character sets.
If you do not care too much about occasionally missing characters, maybe just encode strings with placeholders (see "last resort" below) and ignore the rest.
class EncodingFixer
def initialize(input)
@input = input
end
attr_reader :input
def utf8
output = input if input.encoding.name == 'UTF-8' && input.valid_encoding?
# Try converting from the string's given encoding.
output ||= try_conversion { input.encode('UTF-8') }
# String are sometimes composed of utf-8 bytes while not using utf-8 encoding.
output ||= try_conversion { input.force_encoding('UTF-8') }
# Try interpreting input as Windows-1252, because we've seen many such strings with incorrect encoding.
output ||= try_conversion { input.encode('UTF-8', 'Windows-1252') }
# Add any extra conversions that might make sense in your case.
output ||= try_conversion { input.encode('UTF-8', 'ASCII-8BIT') }
output ||= try_conversion { input.encode('UTF-8', 'US-ASCII') }
# As a last resort, replace any unknown characters with a placeholder: �
output ||= try_conversion { input.encode('UTF-8', invalid: :replace, undef: :replace) }
output
end
private
def try_conversion(&block)
string = yield
string if string.valid_encoding?
rescue Encoding::UndefinedConversionError, Encoding::InvalidByteSequenceError
nil
end
end
Here is a spec that goes along with it.
describe EncodingFixer do
describe '#utf8' do
matcher :convert_to_utf8 do |output|
match do |subject|
result = subject.utf8
expect(result.encoding.name).to eq('UTF-8')
expect(result).to eq(output)
end
end
context 'when given an UTF-8 string' do
subject { described_class.new('Grüße') }
it { is_expected.to convert_to_utf8('Grüße') }
end
context 'when given a string with non-UTF-8 encoding' do
subject { described_class.new("Gr\xFC\xDFe".force_encoding('Windows-1252')) }
it { is_expected.to convert_to_utf8('Grüße') }
end
context 'when given a string with incorrect encoding' do
subject { described_class.new("Gr\xFC\xDFe".force_encoding('ASCII-8BIT')) } # actually Windows-1252
it { is_expected.to convert_to_utf8('Grüße') }
end
context 'when given a string with invalid characters' do
subject { described_class.new("Gr\x80\x81e".force_encoding('ASCII-8BIT')) } # never valid
it { is_expected.to convert_to_utf8('Gr��e') }
end
end
end