Ruby: Retrieving and processing files via Selenium and JavaScript

Posted . Visible to the public.

This card shows an uncommon way to retrieve a file using selenium where JavaScript is used to return a binary data array to Ruby code.

The following code example retrieves a PDF but the approach also works for other file types.

require "selenium-webdriver"

selenium_driver = Selenium::WebDriver.for :chrome
selenium_driver.navigate.to('https://example.com')
link_to_pdf = 'https://blobs.example.com/random-pdf'

binary_data_array = selenium_driver.execute_script(<<-JS, link_to_pdf)
  const response = await fetch(arguments[0])

  if (!response.ok) {
    # throwing an error from JS within execute_script raises Selenium::WebDriver::Error::JavascriptError 
    # which you could handle in ruby
    throw new Error(`Unsuccessful request: ${response.status}`) 
  }

  const contentType = response.headers.get('content-type')
  if (!contentType || !contentType.includes('application/pdf')) {
    throw new Error(`Invalid MIME type: ${contentType}`)
  }

  # turn response into binary data 
  const arrayBuffer = await response.arrayBuffer() 
  # convert array buffer to typed array in order to interact with the data
  # now we have the binary data represented as a typed array of 8-bit unsigned integers 
  # => [37, 208, 212, ...]
  const unsingedInt8Array = new Uint8Array(arrayBuffer)
  return unsingedInt8Array 
JS

# convert the array we got from JS into a binary string which we can use to create a file
binary_string = binary_data_array.pack('C*')

Read the docs for more information about packing data in ruby Show archive.org snapshot

Caveats

  • make sure you can trust the source from where you are retrieving the file
  • this can be very memory intensive for large files

Other ways to solve this problem

Other solutions might work better and make more sense depending on what you are trying to achieve.
These alternatives could be:

  • set a default download directory and use the browser to download to disk
  • get the link of the file and make a HTTP request from ruby code (works great if there is no auth required)
Maximilian Berger
Last edit
Maximilian Berger
License
Source code in this card is licensed under the MIT License.
Posted by Maximilian Berger to makandra dev (2024-08-16 10:55)