How to check if a file is a human readable text file

Posted . Visible to the public.

Ruby's File class has a handy method binary? which checks whether a file is a binary file. This method might be telling the truth most of the time. But sometimes it doesn't, and that's what causes pain. The method is defined as follows:

# Returns whether or not +file+ is a binary file.  Note that this is
# not guaranteed to be 100% accurate.  It performs a "best guess" based
# on a simple test of the first +File.blksize+ characters.
#
# Example:
#
#   File.binary?('somefile.exe') # => true
#   File.binary?('somefile.txt') # => false
#--
# Based on code originally provided by Ryan Davis (which, in turn, is
# based on Perl's -B switch).
#
def self.binary?(file)
  s = (File.read(file, File.stat(file).blksize) || "").split(//)
  ((s.size - s.grep(" ".."~").size) / s.size.to_f) > 0.30
end

As you can see, the documentation itself says this is not guaranteed to be 100% accurate.

Why?

Because it want's to be fast.

What binary? does is it takes the first x* characters and checks how many of them are printable ASCII characters. If more than 30% of these characters are not (maybe special UTF-8 characters, maybe just binary data) printable ASCII characters, then the file must be a binary file!
Sounds fair.

The cake is a lie

Let's have a try. We have the two linked files, dummy_ok.pdf and dummy_fail.pdf. Both are valid PDF files, your PDF Reader will be able to open them and they seem to be the same. But if you ask ruby, the results are different:

File.binary?('path/to/dummy_ok.pdf') => true
File.binary?('path/to/dummy_fail.pdf') => false

That's because I simply prepended some "A"s to the dummy_fail.pdf file.

A simple solution

Ok, now we know how to fool the binary? method. But how can we check if a file is entirely readable with our preferred encoding (We'll use the default UTF-8)?

The following method tries to encode the content of the file to UTF-8. A 'Encoding::UndefinedConversionError' will be raised for characters that are undefined in UTF-8.

def is_file_binary?(path)
  return false unless File.binary?(path)

  content = File.read(path)
  content&.encode('UTF-8', 'binary')
  true
rescue Encoding::UndefinedConversionError => _e
  false
end
Last edit
Julian
License
Source code in this card is licensed under the MIT License.
Posted to makandra dev (2020-08-06 05:11)