How to check if a file is a human readable text file
Ruby's File class has a handy method
binary? which checks whether a file is a binary file. This method might be telling the truth most of the time. But sometimes it doesn't, and that's what causes pain. The method is defined as follows:
# Returns whether or not +file+ is a binary file. Note that this is # not guaranteed to be 100% accurate. It performs a "best guess" based # on a simple test of the first +File.blksize+ characters. # # Example: # # File.binary?('somefile.exe') # => true # File.binary?('somefile.txt') # => false #-- # Based on code originally provided by Ryan Davis (which, in turn, is # based on Perl's -B switch). # def self.binary?(file) s = (File.read(file, File.stat(file).blksize) || "").split(//) ((s.size - s.grep(" ".."~").size) / s.size.to_f) > 0.30 end
As you can see, the documentation itself says
this is not guaranteed to be 100% accurate.
Because it want's to be fast.
binary? does is it takes the first
x* characters and checks how many of them are printable ASCII characters. If more than 30% of these characters are not (maybe special UTF-8 characters, maybe just binary data) printable ASCII characters, then the file must be a binary file!
- x is the file system block size, e.g. 4096 bytes.
Let's have a try. We have the two linked files,
dummy_fail.pdf. Both are valid PDF files, your PDF Reader will be able to open them and they seem to be the same. But if you ask ruby, the results are different:
File.binary?('path/to/dummy_ok.pdf') => true File.binary?('path/to/dummy_fail.pdf') => false
That's because I simply prepended some
"A"s to the
Ok, now we know how to fool the
binary? method. But how can we check if a file is entirely readable with our preferred encoding (We'll use the default UTF-8)?
The following method tries to encode the content of the file to UTF-8. A 'Encoding::UndefinedConversionError' will be raised for characters that are undefined in UTF-8.
def is_file_binary?(path) return false unless File.binary?(path) content = File.read(path) content&.encode('UTF-8', 'binary') true rescue Encoding::UndefinedConversionError => _e false end