Flaky Tests

CI/CD Watch parses JUnit XML test reports to track individual test case stability over time. This lets you identify which specific tests are flaky, how often they fail, and how much they're costing you in wasted pipeline runs.

How Test Results Are Collected

CI providers supply JUnit XML files from pipeline artifacts. CI/CD Watch supports both <testsuites> and <testsuite> as the root element.

Each test case's status is derived from its child elements:

Child Element	Status
`<failure>`	Failed
`<error>`	Error
`<skipped>`	Skipped
None	Passed

Test Classification

Tests are classified using the same thresholds as pipeline stability:

Classification	Criteria
Broken	Failure rate ≥ 80%
Flaky	Flip rate ≥ 30% andfailure rate > 0%
Healthy	Everything else

Sorting Priority

Broken tests are shown first as the most urgent issues to address, followed by flaky tests, then healthy tests. This is the opposite of pipeline stability where flaky pipelines are shown first, for tests, a consistently broken test is typically a higher priority since it blocks every run it appears in.

Failure Details

CI/CD Watch captures the last failure message and stack trace for each test case. This helps you debug failures without having to dig through CI logs, you can see the error message and stack trace directly alongside the test's stability metrics.

Flaky Tests Table

The dedicated flaky tests page (under Stability → Flaky Tests) presents all flaky and broken tests in a sortable table. Each row shows the test name, pipeline, failure count, total runs, flip rate, failure rate, and classification status.

Sort by any column to find the worst offenders, for example, sort by failure rate descending to see the most consistently failing tests first. The table supports 7, 30, and 90-day time windows and uses server-side sorting for fast page loads.

Waste Impact

Each flaky or broken test contributes to wasted pipeline runs. When a test fails intermittently, developers either rerun the entire pipeline or learn to ignore failures , both outcomes are costly.

The flaky tests page shows the estimated cost per test, combining compute charges and developer wait time from the pipeline runs affected by each test's failures. This helps you prioritize which tests to fix based on their real financial impact rather than just their failure rate. See cost calculations for details on how costs are estimated.

Pipeline Stability , how flaky tests affect pipeline-level stability classifications
PR Health , the impact of test failures on pull request workflows
Cost Calculations , how waste from flaky reruns is estimated
Performance Ratings , how slow tests affect pipeline performance