Ruby's File class has a handy method binary?
which checks whether a file is a binary file. This method might be telling the truth most of the time. But sometimes it doesn't, and that's what causes pain. The method is defined as follows:
# Returns whether or not +file+ is a binary file. Note that this is
# not guaranteed to be 100% accurate. It performs a "best guess" based
# on a simple test of the first +File.blksize+ characters.
#
# Example:
#
# File.binary?('somefile.exe') # => true
# File.binary?('somefile.txt') # => false
#--
# Based on code originally provided by Ryan Davis (which, in turn, is
# based on Perl's -B switch).
#
def self.binary?(file)
s = (File.read(file, File.stat(file).blksize) || "").split(//)
((s.size - s.grep(" ".."~").size) / s.size.to_f) > 0.30
end
As you can see, the documentation itself says this is not guaranteed to be 100% accurate
.
Why?
Because it want's to be fast.
What binary?
does is it takes the first x*
characters and checks how many of them are printable ASCII characters. If more than 30% of these characters are not (maybe special UTF-8 characters, maybe just binary data) printable ASCII characters, then the file must be a binary file!
Sounds fair.
- x is the file system block size Show archive.org snapshot , e.g. 4096 bytes.
The cake is a lie
Let's have a try. We have the two linked files, dummy_ok.pdf
and dummy_fail.pdf
. Both are valid PDF files, your PDF Reader will be able to open them and they seem to be the same. But if you ask ruby, the results are different:
File.binary?('path/to/dummy_ok.pdf') => true
File.binary?('path/to/dummy_fail.pdf') => false
That's because I simply prepended some "A"
s to the dummy_fail.pdf
file.
A simple solution
Ok, now we know how to fool the binary?
method. But how can we check if a file is entirely readable with our preferred encoding (We'll use the default UTF-8)?
The following method tries to encode the content of the file to UTF-8. A 'Encoding::UndefinedConversionError' will be raised for characters that are undefined in UTF-8.
def is_file_binary?(path)
return false unless File.binary?(path)
content = File.read(path)
content&.encode('UTF-8', 'binary')
true
rescue Encoding::UndefinedConversionError => _e
false
end