Posted 12 months ago. Visible to the public.

How to check if a file is a human readable text file

Ruby's File class has a handy method binary? which checks whether a file is a binary file. This method might be telling the truth most of the time. But sometimes it doesn't, and that's what causes pain. The method is defined as follows:

Copy
# Returns whether or not +file+ is a binary file. Note that this is # not guaranteed to be 100% accurate. It performs a "best guess" based # on a simple test of the first +File.blksize+ characters. # # Example: # # File.binary?('somefile.exe') # => true # File.binary?('somefile.txt') # => false #-- # Based on code originally provided by Ryan Davis (which, in turn, is # based on Perl's -B switch). # def self.binary?(file) s = (File.read(file, File.stat(file).blksize) || "").split(//) ((s.size - s.grep(" ".."~").size) / s.size.to_f) > 0.30 end

As you can see, the documentation itself says this is not guaranteed to be 100% accurate.

Why?

Because it want's to be fast.

What binary? does is it takes the first x* characters and checks how many of them are printable ASCII characters. If more than 30% of these characters are not (maybe special UTF-8 characters, maybe just binary data) printable ASCII characters, then the file must be a binary file!
Sounds fair.

The cake is a lie

Let's have a try. We have the two linked files, dummy_ok.pdf and dummy_fail.pdf. Both are valid PDF files, your PDF Reader will be able to open them and they seem to be the same. But if you ask ruby, the results are different:

Copy
File.binary?('path/to/dummy_ok.pdf') => true File.binary?('path/to/dummy_fail.pdf') => false

That's because I simply prepended some "A"s to the dummy_fail.pdf file.

A simple solution

Ok, now we know how to fool the binary? method. But how can we check if a file is entirely readable with our preferred encoding (We'll use the default UTF-8)?

The following method tries to encode the content of the file to UTF-8. A 'Encoding::UndefinedConversionError' will be raised for characters that are undefined in UTF-8.

Copy
def is_file_binary?(path) return false unless File.binary?(path) content = File.read(path) content&.encode('UTF-8', 'binary') true rescue Encoding::UndefinedConversionError => _e false end

makandra has been working exclusively with Ruby on Rails since 2007. Our laser focus on a single technology has made us a leader in this space.

Owner of this card:

Avatar
Jakob Scholz
Last edit:
12 months ago
by Julian
Attachments:
dummy_fail.pdf, dummy_ok.pdf
About this deck:
We are makandra and do test-driven, agile Ruby on Rails software development.
License for source code
Posted by Jakob Scholz to makandra dev
This website uses short-lived cookies to improve usability.
Accept or learn more