Ruby: Converting UTF-8 codepoints to characters

Posted . Visible to the public.

Converting string characters to or from their integer value (7-bit ASCII value or UTF-8 codepoint) can be done in different ways in Ruby:

  • String#ord or String#unpack to get character values
  • Integer#chr or Array#pack to convert character values into Strings

Character values to Strings

Integer#chr

To get the character for a 7-bit ASCII value or UTF-8 codepoint (0-127) you can use Integer#chr:

116.chr
# => "t"

To get a character for values larger than 127, you need to pass the encoding. E.g. to get codepoint 252 in UTF-8:

252.chr(Encoding::UTF_8)
# => "ü"

Array#pack

pack may feel less intuitive, but does not require passing an encoding option. You need the U* directive.

[116].pack('U*')
# => "t"
[252].pack('U*')
# => "ü"

Note that you must wrap your value numbers into an Array. In turn, this allows constructing Strings from multiple values easily:

[116, 252, 114, 32, 9786].pack('U*')
# => "tür ☺"

Note that the asterisk (*) is required for strings longer than 1 character.


Strings to character values

String#ord

To convert back from a String to its codepoint, use String#ord:

"t".ord
# => 116

String#unpack

Strings offer unpack as an inverse to Array#pack. Codepoints will be returned as arrays, and you can convert entire strings:

"t".unpack('U*')
# => [106]
"tür ☺".unpack('U*')
# => [116, 252, 114, 32, 9786]

pack/unpack can also convert into many other values or encodings (like quoted-printable). Please see the pack documentation Show archive.org snapshot for more information.

Henning Koch
Last edit
Arne Hartherz
License
Source code in this card is licensed under the MIT License.
Posted by Henning Koch to makandra dev (2016-06-21 10:04)