Posted 12 months ago. Visible to the public.

How to check if a file is a human readable text file

Ruby's File class has a handy method binary? which checks whether a file is a binary file. This method might be telling the truth most of the time. But sometimes it doesn't, and that's what causes pain. The method is defined as follows:

Copy
# Returns whether or not +file+ is a binary file. Note that this is # not guaranteed to be 100% accurate. It performs a "best guess" based # on a simple test of the first +File.blksize+ characters. # # Example: # # File.binary?('somefile.exe') # => true # File.binary?('somefile.txt') # => false #-- # Based on code originally provided by Ryan Davis (which, in turn, is # based on Perl's -B switch). # def self.binary?(file) s = (File.read(file, File.stat(file).blksize) || "").split(//) ((s.size - s.grep(" ".."~").size) / s.size.to_f) > 0.30 end

As you can see, the documentation itself says this is not guaranteed to be 100% accurate.

Why?

Because it want's to be fast.

What binary? does is it takes the first x* characters and checks how many of them are printable ASCII characters. If more than 30% of these characters are not (maybe special UTF-8 characters, maybe just binary data) printable ASCII characters, then the file must be a binary file!
Sounds fair.

The cake is a lie

Let's have a try. We have the two linked files, dummy_ok.pdf and dummy_fail.pdf. Both are valid PDF files, your PDF Reader will be able to open them and they seem to be the same. But if you ask ruby, the results are different:

Copy
File.binary?('path/to/dummy_ok.pdf') => true File.binary?('path/to/dummy_fail.pdf') => false

That's because I simply prepended some "A"s to the dummy_fail.pdf file.

A simple solution

Ok, now we know how to fool the binary? method. But how can we check if a file is entirely readable with our preferred encoding (We'll use the default UTF-8)?

The following method tries to encode the content of the file to UTF-8. A 'Encoding::UndefinedConversionError' will be raised for characters that are undefined in UTF-8.

Copy
def is_file_binary?(path) return false unless File.binary?(path) content = File.read(path) content&.encode('UTF-8', 'binary') true rescue Encoding::UndefinedConversionError => _e false end

Once an application no longer requires constant development, it needs periodic maintenance for stable and secure operation. makandra offers monthly maintenance contracts that let you focus on your business while we make sure the lights stay on.

Owner of this card:

Avatar
Jakob Scholz
Last edit:
12 months ago
by Julian
Attachments:
dummy_fail.pdf, dummy_ok.pdf
About this deck:
We are makandra and do test-driven, agile Ruby on Rails software development.
License for source code
Posted by Jakob Scholz to makandra dev
This website uses short-lived cookies to improve usability.
Accept or learn more