Using feature flags to stabilize flaky E2E tests

A flaky test is a test that is often green, but sometimes red. It may only fail on some PCs, or only when the entire test suite is run.

There are many causes for flaky tests. This card focuses on a specific class of feature with heavy side effects, mostly on on the UI. Features like the following can amplify your flakiness issues by unexpectedly changing elements, causing excessive requests or other timing issues:

Lazy loading images
Autocomplete in search fields
Live form validations
Locking forms against concurrent edits
Polling
Animations
Captchas, request throttling and other abuse protections
Post-processing of file attachments

There are very few tests that actually observe features like lazy loading or polling. But when such features are always enabled, every test that crosses an affected component suffers from the randomness and slowdown that they introduce.

While you can fix every one of these issues by carefully controlling concurrency and timing in your tests, this requires a lot of work and sustained diligence. A better approach is to use feature flags to only enable these features for the one test that needs it.

How always-on features cause flaky tests

Below you can find some real-world examples for always-on features breaking E2E tests.

Example: Autocomplete in search fields

This is from a real project:

A search field requests autocomplete suggestions while the user is typing.
Submitting the search form shows the search results.
Search and Autocomplete was implemented with ElasticSearch.
Communication with ElasticSearch was mocked via VCR.
In a test for the search, the test types a query string into the text field.
This triggers autocomplete requests after a short debouncing delay.
Due to uncontrolled timing, the autocomplete requests were randomly sent before or after the
search form submitted.
💥 VCR would sometimes fail due to missing or unused cassettes.

The affected scenario never looked at autocomplete suggestions. 🤷

Example: Lazy loading of images

This is from a real project:

To save user bandwidth, the loading of images was delayed until the user scrolled close to the image element.
Test clicks on a link at the bottom of the page.
Capybara scrolls to the link so it can click on it.
Scrolling triggers lazy loading of revealed images.
As images pop in, the layout shifts.
Sometimes images load faster than Capybara waits between scrolling and clicking.
Capybara clicks at the wrong pixel position. The link is never followed.
💥 Expectations for the next screen fail because we never left the first screen.

The affected scenario did not care about any images. 🤷

Example: Live form validations

This is from a real project:

In a long form, a <select> automatically validates against the server when changed.
One test changed the <select>, then immediately scrolled down many pages and submitted the form.
This caused two concurrent HTTP requests, one for the validation and one for the submission.
Sometimes the validation request would return after the submit request.
💥 Unpoly tried to show the validation error, but the form was already gone.

This is an actual timing issue in the code, not the test.
However, it took a day to find a bug that could not realistically affect a user. 🤷

Also the affected scenario never looked at the validation results. 🤷

Note

The issue above can no longer occur in Unpoly 2 Show archive.org snapshot 's [up-validate] Show archive.org snapshot .

Example: Locking forms against concurrent edits

This is from a real project:

When opening a form, JavaScript makes a request to acquire a lock for that record (1).
After 20 seconds the JavaScript starts polling the server (2),
to check if another user has taken the lock away from us by force.
When the PC was low on resources (parallel_tests), acquiring the lock (1) took so long that the client
starts polling for updates (2) before the initial lock (1) was acquired.
While the polling request (2) is in flight, the first request (1) has finally acquired the lock.
Since the polling request (2) didn't have the lock yet, the server responds saying
that another user (1) has taken away the lock (that other user being ourselves!).
The form shows an error message and disables all fields.
💥 Subsequent interactions with the form fails because all fields are disabled.

This is something that we fixed in the code, not the test. The code is better now.
However, it took a day to find a bug that no user has seen in 6 years. 🤷

Example: Post-processing of file attachments

This is the earlier example from a real project:

A file upload has heavy post-processing by a CarrierWave uploader (resizing a large image, extracting thumbnails from a video, rendering a preview image from a PDF).
The post-processing always finished within the configured Capybara timeout on a fast desktop PC.
When the same test was run on a slower PC or within a VM, it sometimes took longer than Capybara's retry timeout.
💥 The upload sometimes failed.

The affected scenario did not care about any processed file versions. 🤷

Example: Polling

This is from a real project:

JavaScript periodically requests updates from a server to keep a <div> in sync with the latest server-site data.
Sometimes the test ended right after a polling request was sent, but before it was received by the server process.
Capybara stops the server process to shutdown the test suite.
The browser request fails.
💥 The browser reports a network error.
The step there should be no JavaScript error fails the scenario.

The affected scenario did not care about the periodic server updates. 🤷

Example: Animations

Testing an animated UI causes countless problems. Here are only a few:

While an element fades in, it may not be clickable.
While an element slides in from the edge of screen, Capybara's clicks may hit a previous position.
While an element is swapped with a new version using a transition effect, Capybara will see two versions of the element. It may interact with the wrong one!

Using feature flags to tame problematic functionality

For each problematic features, make a configuration option that enables it. Then enable it by default for all environments:

# config/environments/application.rb
config.feature_polling = true
config.feature_animations = true
config.feature_lazy_load_images = true
config.feature_live_validations = true
config.feature_form_content_locks = true
config.feature_photo_processing = true
config.feature_autocomplete = true

Only for the test environment we disable each feature by default:

# config/environments/test.rb
config.feature_polling = false
config.feature_animations = false
config.feature_lazy_load_images = false
config.feature_live_validations = false
config.feature_form_content_locks = false
config.feature_photo_processing = false
config.feature_autocomplete = false

Now change your code for polling, animations, etc. to honor these feature flag. When a component sees its feature flag disabled, it should disable itself.

Making your code aware of feature-flags

For this example we will use an Unpoly Show archive.org snapshot compiler Show archive.org snapshot to build a [poll] attribute. This attribute causes an element to be reloaded from the server every 5 seconds:

<ul id="users" poll>
  <li>zofletcher</li>
  <li>monicaphelps</li>
  <li>johnharris </li>
</ul>

A simple, feature-flag-ware polling implementation could look like this:

up.compiler('[poll]', function(element) {
  if (isFeatureDisabled('polling')) return
  
  // Reload element every 5 seconds.
  const timer = setInterval(function() { up.reload(element) }, 5_000)

  // Stop reloading when the element is removed from the DOM.
  return function() { clearInterval(timer) }
})

function isFeatureDisabled(feature) {
  const meta = document.querySelector(`meta[name='feature:${feature}']`)
  return meta.content === 'false'
}

Note how line #2 lets us disable polling using a <meta name="feature:polling"> element on the <head>:

<html>
  <head>
    <meta name="feature:polling" content="false">
  </head> 
   ...
</html>

We enable polling for all environments by default, but disable polling for the test environment:

# config/application.rb
config.feature_polling = true

# config/environments/test.rb
config.feature_polling = false

We echo the environment setting in our application layout:

<head>
  <%= tag :meta, name: 'feature:polling', content: Rails.configuration.feature_polling %>
</head>

Now polling is disabled by default for all tests.
Our test suite has immediately become faster and more stable.

Enabling features for individual tests

We now enable the polling feature for the one E2E test that tests polling.

In an RSpec request spec this would look like this:

scenario 'The project list is updated periodically' do
  # Enable polling for this test
  allow(Rails.configuration).to receive(:feature_polling).and_return(true)
  
  # Go to the projects index and see an empty list.
  visit projects_path
  expect(page).to have_text('No projects yet')
  
  # When another user creates a project it automatically appears in our list.
  create(:project, name: 'Super project')
  expect(page).to have_text('Superproject')
end

If you're using Cucumber, it would look like this:

Scenario: The project list is updated periodically
  Given the polling feature is enabled
  When I go to the list of projects
  Then I should see "No projects yet"
  When another user creates a project "Superproject"
  Then I should see "Superproject"

Here is the step that mocks Rails.configuration.feature_polling for that one scenario:

Given /^the (.*?) feature is enabled$/ do |feature|
  allow(Rails.configuration).to receive("feature_#{feature}").and_return(true)
end

Note

You may want to also make the polling frequency configurable. This way the test for polling does not need to wait 5 seconds to observe a change.

Henning Koch

makandra.de

Say thanks3

Last edit

2024-07-11

Michael Leimstädtner

License

Source code in this card is licensed under the MIT License.

Posted by Henning Koch to makandra dev (2021-10-12 11:21)