Lightning Talk: Coverage based Test Case Prioritization in Ruby on Rails

Updated . Posted . Visible to the public.

For my computer science bachelor's thesis I programmed and evaluated a CLI Test Case Prioritization (TCP) tool for makandra. It has been written as a Ruby Gem and was tested and evaluated against one Ruby on Rails project. This card will summarize and present the research results, the evaluation and the programmed CLI tool.

The code Show archive.org snapshot has been published for educational purposes on GitHub. The german bachelor's thesis has also been included for download at the end.


This card is roughly structured as follows:

1. Research approach
2. Background
3. Implementation
3.1 Usage
3.xx Implementation details

4. Evaluation
4.1 Strategy
4.2 Results

5. Practical limitations and possible improvements
6. Outlook
7. Feedback & Questions

Research approach

To find out which which prioritization tool is the most suitable I described the problem domain to conclude different criteria, which a tool for TCP should satisfy.

Then I introduced several strategies to prioritize test cases and assessed which tool satisfies the criteria. Based on this evaluation I picked on strategy, implemented it, evaluated by different metrics. After that I draw conclusions on it's practicability and usage limitations.

Background

Problem domain

The tool should be suitable for modern web based development workflows with Ruby on Rails.

Criteria

  • Faster results of failing tests than random prioritization to lower regression testing costs.
  • The tool should incorporate the testing data from any test run, such that no extra test run are required.
  • The prioritization should be able to use current modifications, to prioritize test cases which might be failing due to these modifications.
  • Newly added test-cases have to granted high priority within the calculated prioritization.

Test Case Prioritization

The testsuite is ordered in a such a way that tests with higher priority are placed before tests with lower priority. The priority is assessed by different evaluation metric and generally aims at achieving faster error detection rate.

Test case prioritization strategies

I introduced the following strategies for TCP:

  • Coverage based
  • Fault based
  • Cost based
  • History based
  • Requirement based
  • Search based
  • Machine learning
  • Combinations of methods

For brevity and concise demonstration of the results I only include a introduction to the coverage based TCP.

Coverage based TCP

Coverage is a metric to express how many files are covered by a testsuite. It can be measured by different granularities, which are function-, statement-, branch- or condition-coverage. This metric is used to measure how good a given programmed is covered by it's testsuite.

The linked image in the SimpleCov gem Show archive.org snapshot gives practical example Show archive.org snapshot on the visualization of coverage:

Coverage example

Coverage-based TCP generally tries to detect affected tests from modifications by checking which tests have coverage for a modified line of code. Now the detected specs can be ordered by different strategies. The most common strategies to achieve this are total and additional coverage.

Total coverage sorts these files by their total value of coverage, while additional coverage tries to maximize coverage across the complete project for all modified files.

T1 T2 T3 T4 T5 T6
F1 0 10 5 0 2 0
F2 1 6 0 4 0 0
F3 0 1 4 0 0 0
F4 0 2 20 12 0 0
F5 1 1 0 0 3 9

Table 1: This table includes the counts on how many time a test Tn covers a file Fm. If file 2, 3 and 4 would have been modified only Test cases 1 - 4 would be selected and prioritized, while every other file would be added randomly afterwards. For total coverage it would select T3 > T4 > T1 > T2 > T5 > T6, an additional coverage could prioritize the tests as T2 > T3> T1 > T4 > T5 > T6.

Selection of a strategy

In the preselection coverage-, search-based, model- and combinations of different prioritization methods. History- and fault-based approaches could be used as a secondary prioritization criterion.

I have chosen coverage-based since it is one of the most widely researched approaches and used as a baseline within research to compare approach to each other. This can serve as a ground for further research into this topic within this problem domain.

Implementation

Usage

  1. First you have to require spec_helper_coverage.rb Show archive.org snapshot , which hooks the measurement of the coverage into every executed spec:
require 'testsort'

coverage_measurement = Testsort::CoverageMeasurement.new
coverage_measurement.start

RSpec.configure do |config|
  config.after do |example|
    coverage_measurement.cover(example)
  end

  config.after(:suite) do
    coverage_measurement.finish
  end
end
  1. Once enough data has been recorded you can use that to prioritize your specs by calling bundle exec prioritized. Then it will read out the current modifications and the saved coverage data, prioritize the specs by that and pass them to rspec to run in that order.

Optional: You can also use the --strategy parameter to choose a strategy, but more on that later.

Recording of the coverage-data

Granularity

The implementation uses a file-based granularity of the coverage, which means the coverage is mapped as was laid out in Table 1. This reduced the problem of having to update a complex line-based coverage for new changes to less necessary updates.

Measurement

The coverage is measure after every spec run by using Ruby's interal coverage module Show archive.org snapshot . The coverage results of every spec is merged together into one matrix.After a spec finishes the results of the coverage module have to cleared for the next spec to only include current data.

After all specs are finished it has to be combined with results from previous runs. If a file has been modified we can use only the results of the new test run for that manner.

Then this data structure is saved, such that it can be read out and used for prioritization. The matrix is saved as a binary, while the columns are saved as a hash to map files to the index within the matrix.

Prioritization

Once the command to run a prioritized test suite has been called, the gem Rugged Show archive.org snapshot is used to read out the changes as listed in git status, which is sufficient since only file based coverage is used.

After that the data structure of the coverage data will have to be initialized. Then it will use that data structure to select all tests, which do have a coverage of the modified files. Now a prioritization of the selected files can be calculated, by which all tests are ordered.
All not selected files we be added randomly to prioritized testsuite order.

Strategies

Three strategies have been used to calculate the prioritization, these are total, additional and pseudo additional.

Total

By the total counts of hits.

Additional

  1. Order all selected tests by number of modified file hit and then by the their total count.
  2. Select iteratively the tests, which add new coverage information to project, such that covered files are only covered once within the current iteration.
  3. Reset the covered files with every iteration.
  4. Repeat until all preselected tests have been added.

Pseudo additional

Use the order of step 1 from the additional strategy as prioritization.

Evaluation

Used metrics

Average percentage fault detection (APFD)

This metric express how fast the a given testsuite of n tests is able to find m errors within program. If you look at a normalized fault detection graph, it's closely the same as the integral under the graph, which means for fast detection it is close 1 and for even detection close to 0.5:
Image

Time until first failing spec

I also recorded the time until the first failing spec result has been given. This is useful because this is the moment a developer can start working on the error.

Strategy

I checked out every available commit in the repository and set the project to a working state for that commit by installing all dependencies. Then I reset only the specs to one commit before the changes of the current commit were introduced. Then I started running and recording data for all the strategies.

Afterwards I had to clean the data because a lot of commits were not able to produce valid results.

The advantage of this approach is, that it is able to produce a lot errors within tests that happen with actual changes from developers. Many academic research within this area depends on a fixed set of predefined errors. Another big advantage is, that this can give insight into the robustness of this methods to specific kind of changes.

Image

This picture gives an example on how prioritization strategies can vary for every commit. From left to right the strategies are: random, absolute, additional and pseudo-additional.

Results

Scatter plots and histograms of each strategy

Image
Image

Box plots

Image

Mean and median values

APFD mean APFD median 1stFTT mean 1stFTT median
Random 0.5 0.52 01:42 01:10
Absolute 0.91 0.94 00:48 00:28
Additional 0.85 0.89 00:31 00:15
Pseudo-additional 0.89 0.93 00:40 00:19

1stFTT = First failing test time

Comparison

  • Every strategy produced higher APFD values than the random prioritization in the ranking: Absolute > Pseudo-additional > Additional > Random
  • The additional strategies APFD values have the widest dispersion.
  • Every method gives a time savings, even the worst method had time average saving of 00:54 minutes and the best 01:11 minutes.
  • For 25 commits per day this would be an average time saving of 16:30 minutes per week.
  • Pseudo-additional strategy has been the most robust across all metrics.
  • Additional and absolute strategy outperform only one area, while having a disadvantage in the other.

Limitations

  • The evaluation should be tested with a variance analysis for several projects.
  • The time values have not been normalized to the total execution time of the testsuite.
  • A cost benefit analysis could give greater insight into the amortization of costs and about the actual time savings.

Practical limitations and possible improvements

  • The evaluation method is still faking errors and does not include edge cases from real live scenarios.
    • More testing and evaluation with the actual workflow of developers is required.
  • Does currently not support parallel tests
  • Several files would have to be kept up to date in version control
  • Would have to be supported within CI/CD
  • Changes in JavaScript and CSS lead to insuficient mapping to features, even though template support has been enabled to Coverage. This gives a benefit because changes in JavaScript often correlate with changes in HTML.
    • This may partly also be influenced by the fact that features are harder to prioritize than unit tests because they have so many dependencies across the project and often group together longer interactions -> splitting features in shorter interactions could help a little bit with this, but might make feature less meaningless.
      • A significant amount of commits had to be discarded since the results were not representative, in JS heavy project this have been ~2/5 of commits.
  • Testsuite has to be triggered manually
    • Could be triggered after some changes
  • Test order dependencies should still be recognizable
  • Currently the not preselected tests are not added randomly at the end
  • Edge cases with version control, since currently it only reads out git status, which may change after commits or rebase drastically.
    • A more complex logic would be required to read out the difference.
  • Merge of files is only allowed to happen when a complete test file has been run and should be blocked for individual tests.
    • More complex mapping or data structure
  • Coverage is only used when tests do have coverage for specific modifications, it could also be interesting to find tests where the coverage differs from latest commit before the current branch.
  • Deleting of code files has to identified correctly. Currently this data is not used for prioritization, even though the affected tests still need to produce adequate results.
  • Rspec does not plot the results instantly
  • The method uses rather gross granularity for the mapping of tests to code files. Finer granularity had better prioritization results within previous research, but requires a lot more detailed data structure for representing changes within the version control.

Outlook

  • It currently could be tested in some projects. This might require adequate usage of the tool.
    • It might need some smaller improvements to support more dependencies, currently has only been tested for Ruby 3.2 and Rails 7.0.4.2
    • Depending on the value, a next phase of rolling it out as a gem could started.
  • It could be possible to publish an article in Ruby weekly.
  • Further academic work would be possible on this topic, but it is not planned at the moment.
    • Other methods could be implemented, evaluated and compared.
    • Variations and enhancements of the current algorithms could be tested.
Posted by Felix Eschey to makandra dev (2023-09-18 18:33)