How does stability classification decide healthy vs flaky vs broken?

Each pipeline is classified on its recent runs. Healthy is consistently green. Flaky is intermittent failure on the same code (failure with no relevant change between attempts). Broken is sustained failure that needs a fix. The classifier looks at run history, retry behaviour, and whether re-runs flip outcomes without code changes. Thresholds shown on /docs/pipeline-stability.

How is flakiness scored?

Two signals combined: failure rate over the last 30 days and flip rate (how often a re-run on the same commit changes outcome). Tests with high failure but consistent failure are broken, not flaky. Tests with low failure but high flip rate are the most disruptive flakies. The ranking is sortable on the flaky-tests view.

How should I plan a fix-it sprint with this data?

Sort the flaky-test ranking by cost impact (compute reruns plus developer wait time). Pick the top three to five; they typically account for the majority of disruption. Pair that with the broken-pipeline list filtered to your repos. The cost-of-quality view tells you the dollar return on fixing each one.

What counts as a slow outlier in performance ratings?

Performance is per-pipeline percentile-based. A pipeline rated slow has p95 duration well above its peers in the same repo or against similar pipelines across the estate. The rating is durable across runs so day-to-day variance doesn't mislabel a pipeline.

Does this replace BuildPulse?

BuildPulse goes deep on test-level flake history (test names, stack traces, owner mapping). CI/CD Watch covers pipeline-level stability plus test-level flake ranking, with DORA and cost in the same view. For teams whose primary pain is test-debt triage with deep per-test ownership and quarantine workflows, BuildPulse may sit alongside. For teams who want stability, performance, cost, and DORA in one place at flat-rate, CI/CD Watch replaces it. See /compare/buildpulse for the side-by-side.

How does PR cycle time per repo help with priorities?

Repos with long PR cycle times often have a CI/CD-shaped problem: slow pipelines, flaky reruns, or merge queues that stall on broken trunk. Sort repos by cycle time and the worst offenders are usually the right fix-it targets. The view shows median + p95 so you don't over-react to a single tail.

What providers are supported?

GitHub Actions, GitLab CI, Bitbucket Pipelines, CircleCI, Azure DevOps, and Jenkins. Connect each one separately; everything aggregates into one stability view. No agents in any of them. OAuth or access token from us to your provider.

Are the cost-of-quality numbers real dollars?

Yes. Compute cost is calculated from the provider's pricing model (per-minute, credits, slots, or your Jenkins rate). Wait time is calculated from a configurable engineer-hour rate. The flaky-test cost view multiplies failure rate by average rerun cost across the last 30 days. Set rates under Settings.

How does pricing work?

Flat per tenant: $0 Free, $29 Team, $99 Business per month. Repo and team-member caps differ per tier (3/1 on Free, 20/10 on Team, 100/50 on Business). Consumption inside those caps is unmetered. Enterprise is available for SSO, audit log, and security review.

For tech leads

CI/CD for tech leads: know where to spend fix-it time

Every pipeline auto-classified healthy, flaky, or broken. Flaky tests ranked by impact. Slow pipelines flagged. Decide fix-it sprint priorities with data, not vibes.

Start free See pricing

cicd.watch/stabilityStability · live

Pipelines by stability state · last 30 days

build.ymlghplatform-api · main · 142 runs

e2e-testsghweb · main · 38% flip rate · ~$840/mo

nightly-buildjenkinslegacy-erp · main · 8 days red

deploybbmobile · env:prod · 56 runs

integration.ymlghplatform-api · pull_request · ~$420/mo

Stability · Performance · Flaky tests

Three views every tech lead uses for prioritising

Stop guessing where to invest fix-it time. Three capabilities the same connect gives you, ranked so the worst offenders are obvious.

Stability classification

Every pipeline auto-classified healthy, flaky, or broken.

Stability classification runs across every connected provider. Healthy means consistently green. Flaky means intermittent failure on unchanged code. Broken means sustained failure. The classifier uses run history and retry behaviour, not a single flapping commit.

Three durable states (healthy, flaky, broken) per pipeline
Trend detection on each: improving, steady, degrading
Re-run flip rate factored in so retries don't hide the problem

Outcome: a defensible list of which pipelines need attention this sprint.

What's needed: Pipelines must have run in the last 30 days for classification.

cicd.watch/stability

Pipeline stability · current state

Healthy

Flaky

Broken

nightly-buildjenkinslegacy-erp · 8 days red · degrading

e2e-testsghweb · 38% flip rate · degrading

integration.ymlghplatform-api · 24% flip rate · steady

Pipelines split by state; trend arrow on each shows whether it's improving or degrading

Performance ratings

Per-pipeline performance scoring. Spot the slow outliers.

Performance ratings score every pipeline on p50 and p95 duration against its peers. Slow outliers surface at the top so you know which pipelines are quietly eating engineering hours.

Per-pipeline rating durable across runs, so day-to-day variance doesn't mislabel
p50 and p95 both shown so you see the typical case and the worst case
Rate-of-change badge: getting slower, getting faster, holding

Outcome: the three pipelines worth profiling first, with the data to back the prioritisation.

What's needed: A few weeks of run history. New pipelines show as 'establishing baseline' for the first 30 days.

cicd.watch/performance

Performance ratings · sorted by p95

Pipelinep50p95RatingTrend

nightly-build

legacy-erp

18m 22s32m 04sSlowgetting slower

e2e-tests

web

8m 14s14m 41sSlowsteady

integration.yml

platform-api

6m 02s11m 18sOKgetting faster

build.yml

platform-api

3m 41s5m 12sFaststeady

Pipelines sorted by p95 duration; trend arrow signals getting slower or faster

Flaky test ranking

Flaky tests ranked by impact. Failure rate and flip rate combined.

Flaky tests ranked by a combined score: failure rate plus flip rate. High failure with consistent failure is broken, not flaky. Low failure with high flip rate is what disrupts merging. Sortable, exportable, scoped per repo or across the estate.

Top 10 flakies surface the disruption; long-tail list available behind it
Cost-of-quality column: estimated dollar cost per flaky test per month
JUnit-style report parsing across GitHub Actions, GitLab, and the others

Outcome: a fix-it backlog already ranked by where the money is.

What's needed: JUnit / xUnit-format test results uploaded as a CI artifact. Native on most stacks.

cicd.watch/stability/flaky-tests

Flaky tests · top 5 by cost-of-quality

TestFailureFlip rateCost/moTrend

checkout.spec.ts

web

11%38%$840up

auth/login.test.ts

web

7%29%$420steady

e2e/cart.e2e.ts

web

6%24%$310down

api/orders.test.ts

platform-api

4%18%$180steady

legacy/import.test.ts

legacy-erp

3%14%$120steady

Flaky tests ranked by combined score: failure rate plus flip rate, monetised

Same connect, more depth

How we work with tech leads

Three more cuts the same data unlocks. Each one helps tech leads pick the right fight rather than the loudest one.

Cost of quality

Cost-of-quality view: flaky reruns and retry compute in dollars

Every flaky test has a dollar cost: compute spent on reruns plus developer wait time. The fix-it ROI per test makes the case for spending the sprint on it.

checkout.spec.ts~$840/mo

auth/login.test.ts~$420/mo

e2e/cart.e2e.ts~$310/mo

See cost-of-quality →

PR cycle time

PR cycle time per repo highlights the CI-shaped problems

Repos with long PR cycle time usually have a CI/CD-shaped problem: slow pipelines, flaky reruns, or merge queues stalling on broken trunk. Sort repos by cycle time and the fix-it targets are obvious.

platform-api2d 14h p95

mobile-ios1d 06h p95

web7h 22m p95

See PR health →

DORA by repo

DORA-by-repo trend shows which repos are getting better

DORA broken out per repo with 7/30/90-day trend. Each metric rated Elite / High / Medium / Low against the 2024 DORA Report thresholds. Useful for tech leads owning two or three repos: you see your own numbers without estate noise.

platform-api · Lead timeElite ↑

platform-api · Change fail rateHigh →

web · Deployment frequencyMedium ↓

See DORA metrics →

All from one connect

Plus the rest of the toolkit

Stability classification is the lead for tech leads. Same connect also gives you DORA, cost tracking, PR health, performance, Slack, CLI, and an MCP server.

Stability classification →

Every pipeline auto-classified healthy, flaky, or broken. Trend detection on each.

Flaky tests →

Ranked by failure rate and flip rate. Cost-of-quality column included.

Performance ratings →

Per-pipeline scoring. p50 and p95 both shown.

Cost tracking →

Compute and wait time per pipeline, dollar-normalised across providers. Team tier.

PR health →

Per-repo CI failure rates, reviewer wait time, and PR-to-deploy latency. Team tier.

DORA metrics →

All five DORA metrics per repo with 7/30/90-day trend lines.

Slack notifications →

Stability regressions and broken-trunk alerts in your team channel. Team tier.

MCP server →

Hook Claude, Cursor, or any AI agent into live pipeline state.

Pricing

Flat per tenant

Start free for one team. Team and Business tiers are flat monthly rates per tenant. Enterprise is custom for organisations needing SSO, audit logging, security review, and on-premise connector deployment.

Free

For one team getting started with up to 3 repos.

$0/month

Start free

3 repos
1 team member
Stability classification
DORA metrics, flaky-test ranking
Email support

Team

Flat rate per tenant. Up to 20 repos and 10 team members.

$29/month

Start Team trial

20 repos
10 team members
Everything in Free
Cost tracking with full history
PR health, performance ratings
Slack notifications, CLI, MCP server

Business

Flat rate per tenant. Up to 100 repos and 50 team members.

$99/month

Start Business trial

100 repos
50 team members
Everything in Team
Audit findings and cost-optimization opportunities
Priority support

Comparison

How CI/CD Watch compares for tech leads

A tech lead owning two or three repos who wants stability, flakiness, and DORA in one view. Headline pricing only; deeper feature comparisons live on the linked pages.

	CI/CD Watch$29 / mo flat (Team)	BuildPulseFrom $99 / mo (flat tier)See full comparison →	LinearB$29 / contributor / mo (Essentials, annual)See full comparison →	Datadog CI VisibilityFrom $8 / committer / mo + per-span overagesSee full comparison →
Pipeline stability classification	Yes, healthy/flaky/broken	Pipeline level	Limited	Pipeline traces
Flaky test ranking with cost	Yes, dollar-ranked	Yes, test-level depth	No	Yes, test-level
Performance ratings per pipeline	Yes, p50 + p95	No	Cycle time only	Yes, trace-based
DORA metrics included	Yes, all five	No	Yes	Add-on
Cost of quality (dollars)	Yes, compute + wait time	No	Wait time only	Compute only
Pricing model	Flat per tenant	Flat tiers	Per contributor	Per committer + spans

Competitor pricing reflects each vendor's published headline rate. See the linked comparison pages for fuller feature matrices and verified sources.

stability states per pipeline

signals per flaky-test score

p95

duration in every performance rating

Flat

per-tenant pricing

FAQ

Tech lead specifics

How does stability classification decide healthy vs flaky vs broken?: Each pipeline is classified on its recent runs. Healthy is consistently green. Flaky is intermittent failure on the same code (failure with no relevant change between attempts). Broken is sustained failure that needs a fix. The classifier looks at run history, retry behaviour, and whether re-runs flip outcomes without code changes. Thresholds shown on /docs/pipeline-stability.
How is flakiness scored?: Two signals combined: failure rate over the last 30 days and flip rate (how often a re-run on the same commit changes outcome). Tests with high failure but consistent failure are broken, not flaky. Tests with low failure but high flip rate are the most disruptive flakies. The ranking is sortable on the flaky-tests view.
How should I plan a fix-it sprint with this data?: Sort the flaky-test ranking by cost impact (compute reruns plus developer wait time). Pick the top three to five; they typically account for the majority of disruption. Pair that with the broken-pipeline list filtered to your repos. The cost-of-quality view tells you the dollar return on fixing each one.
What counts as a slow outlier in performance ratings?: Performance is per-pipeline percentile-based. A pipeline rated slow has p95 duration well above its peers in the same repo or against similar pipelines across the estate. The rating is durable across runs so day-to-day variance doesn't mislabel a pipeline.
Does this replace BuildPulse?: BuildPulse goes deep on test-level flake history (test names, stack traces, owner mapping). CI/CD Watch covers pipeline-level stability plus test-level flake ranking, with DORA and cost in the same view. For teams whose primary pain is test-debt triage with deep per-test ownership and quarantine workflows, BuildPulse may sit alongside. For teams who want stability, performance, cost, and DORA in one place at flat-rate, CI/CD Watch replaces it. See /compare/buildpulse for the side-by-side.
How does PR cycle time per repo help with priorities?: Repos with long PR cycle times often have a CI/CD-shaped problem: slow pipelines, flaky reruns, or merge queues that stall on broken trunk. Sort repos by cycle time and the worst offenders are usually the right fix-it targets. The view shows median + p95 so you don't over-react to a single tail.
What providers are supported?: GitHub Actions, GitLab CI, Bitbucket Pipelines, CircleCI, Azure DevOps, and Jenkins. Connect each one separately; everything aggregates into one stability view. No agents in any of them. OAuth or access token from us to your provider.
Are the cost-of-quality numbers real dollars?: Yes. Compute cost is calculated from the provider's pricing model (per-minute, credits, slots, or your Jenkins rate). Wait time is calculated from a configurable engineer-hour rate. The flaky-test cost view multiplies failure rate by average rerun cost across the last 30 days. Set rates under Settings.
How does pricing work?: Flat per tenant: $0 Free, $29 Team, $99 Business per month. Repo and team-member caps differ per tier (3/1 on Free, 20/10 on Team, 100/50 on Business). Consumption inside those caps is unmetered. Enterprise is available for SSO, audit log, and security review.

Read, compare, or get started

Guide

Pipeline stability

How stability classification works, the thresholds for healthy/flaky/broken, and how trend detection runs.

Guide

Flaky tests

How CI/CD Watch detects, scores, and ranks flaky tests, plus the JUnit-style formats supported.

Blog

Flaky tests: what they are, why they happen, how to fix them

Flaky tests are not noise. They are signal that branching, test ownership, or environment hygiene is weaker than it looks.

Blog

The real cost of flaky tests (compute + wait time)

The real cost of flaky tests is compute reruns plus developer wait time. The maths on why every flake pays twice.

Blog

CI/CD monitoring: beyond watching pipelines go green

What CI/CD monitoring should actually surface for the team owning the pipelines, and where most dashboards stop short.

Blog

What are DORA metrics and why should you track them?

The four (now five) signals from DORA Research that measure how well a software team delivers. What each one means and how to read them.

Explore other use cases

See how CI/CD Watch helps every role in your engineering org.

Stop guessing which pipelines to fix.

Connect what you've got in two minutes per provider. Stability, flaky tests, and performance ratings ranked by where the money is.

Start free Talk to us

CI/CD for tech leads: know where to spend fix-it time

Three views every tech lead uses for prioritising

Every pipeline auto-classified healthy, flaky, or broken.

Per-pipeline performance scoring. Spot the slow outliers.

Flaky tests ranked by impact. Failure rate and flip rate combined.

How we work with tech leads

Cost-of-quality view: flaky reruns and retry compute in dollars

PR cycle time per repo highlights the CI-shaped problems

DORA-by-repo trend shows which repos are getting better

Plus the rest of the toolkit

Stability classification →

Flaky tests →

Performance ratings →

Cost tracking →

PR health →

DORA metrics →

Slack notifications →

MCP server →

Flat per tenant

Free

Team

Business

How CI/CD Watch compares for tech leads

Tech lead specifics

Read, compare, or get started

Pipeline stability

Flaky tests

Flaky tests: what they are, why they happen, how to fix them

The real cost of flaky tests (compute + wait time)

CI/CD monitoring: beyond watching pipelines go green

What are DORA metrics and why should you track them?

Explore other use cases

For Developers

For Engineering Managers

For Platform, DevOps & SRE

For DevSecOps

For AI-assisted development

Stop guessing which pipelines to fix.