How to check if a file is a human readable text file

Posted . Visible to the public.

Ruby's File class has a handy method binary? which checks whether a file is a binary file. This method might be telling the truth most of the time. But sometimes it doesn't, and that's what causes pain. The method is defined as follows:

# Returns whether or not +file+ is a binary file.  Note that this is
# not guaranteed to be 100% accurate.  It performs a "best guess" based
# on a simple test of the first +File.blksize+ characters.
#
# Example:
#
#   File.binary?('somefile.exe') # => true
#   File.binary?('somefile.txt') # => false
#--
# Based on code originally provided by Ryan Davis (which, in turn, is
# based on Perl's -B switch).
#
def self.binary?(file)
  s = (File.read(file, File.stat(file).blksize) || "").split(//)
  ((s.size - s.grep(" ".."~").size) / s.size.to_f) > 0.30
end

As you can see, the documentation itself says this is not guaranteed to be 100% accurate.

Why?

Because it want's to be fast.

What binary? does is it takes the first x* characters and checks how many of them are printable ASCII characters. If more than 30% of these characters are not (maybe special UTF-8 characters, maybe just binary data) printable ASCII characters, then the file must be a binary file!
Sounds fair.

The cake is a lie

Let's have a try. We have the two linked files, dummy_ok.pdf and dummy_fail.pdf. Both are valid PDF files, your PDF Reader will be able to open them and they seem to be the same. But if you ask ruby, the results are different:

File.binary?('path/to/dummy_ok.pdf') => true
File.binary?('path/to/dummy_fail.pdf') => false

That's because I simply prepended some "A"s to the dummy_fail.pdf file.

A simple solution

Ok, now we know how to fool the binary? method. But how can we check if a file is entirely readable with our preferred encoding (We'll use the default UTF-8)?

The following method tries to encode the content of the file to UTF-8. A 'Encoding::UndefinedConversionError' will be raised for characters that are undefined in UTF-8.

def is_file_binary?(path)
  return false unless File.binary?(path)

  content = File.read(path)
  content&.encode('UTF-8', 'binary')
  true
rescue Encoding::UndefinedConversionError => _e
  false
end
Profile picture of Jakob Scholz
Jakob Scholz
Last edit
Julian
License
Source code in this card is licensed under the MIT License.
Posted by Jakob Scholz to makandra dev (2020-08-06 05:11)