For tech leads

CI/CD for tech leads: know where to spend fix-it time

Every pipeline auto-classified healthy, flaky, or broken. Flaky tests ranked by impact. Slow pipelines flagged. Decide fix-it sprint priorities with data, not vibes.

cicd.watch/stabilityStability · live

Pipelines by stability state · last 30 days

build.ymlghplatform-api · main · 142 runs
e2e-testsghweb · main · 38% flip rate · ~$840/mo
nightly-buildjenkinslegacy-erp · main · 8 days red
deploybbmobile · env:prod · 56 runs
integration.ymlghplatform-api · pull_request · ~$420/mo

Stability · Performance · Flaky tests

Three views every tech lead uses for prioritising

Stop guessing where to invest fix-it time. Three capabilities the same connect gives you, ranked so the worst offenders are obvious.

1

Stability classification

Every pipeline auto-classified healthy, flaky, or broken.

Stability classification runs across every connected provider. Healthy means consistently green. Flaky means intermittent failure on unchanged code. Broken means sustained failure. The classifier uses run history and retry behaviour, not a single flapping commit.

  • Three durable states (healthy, flaky, broken) per pipeline
  • Trend detection on each: improving, steady, degrading
  • Re-run flip rate factored in so retries don't hide the problem

Outcome: a defensible list of which pipelines need attention this sprint.

What's needed: Pipelines must have run in the last 30 days for classification.

cicd.watch/stability

Pipeline stability · current state

Healthy
34
Flaky
9
Broken
3
nightly-buildjenkinslegacy-erp · 8 days red · degrading
e2e-testsghweb · 38% flip rate · degrading
integration.ymlghplatform-api · 24% flip rate · steady
Pipelines split by state; trend arrow on each shows whether it's improving or degrading
2

Performance ratings

Per-pipeline performance scoring. Spot the slow outliers.

Performance ratings score every pipeline on p50 and p95 duration against its peers. Slow outliers surface at the top so you know which pipelines are quietly eating engineering hours.

  • Per-pipeline rating durable across runs, so day-to-day variance doesn't mislabel
  • p50 and p95 both shown so you see the typical case and the worst case
  • Rate-of-change badge: getting slower, getting faster, holding

Outcome: the three pipelines worth profiling first, with the data to back the prioritisation.

What's needed: A few weeks of run history. New pipelines show as 'establishing baseline' for the first 30 days.

cicd.watch/performance

Performance ratings · sorted by p95

Pipelinep50p95RatingTrend
nightly-build
legacy-erp
18m 22s32m 04sSlowgetting slower
e2e-tests
web
8m 14s14m 41sSlowsteady
integration.yml
platform-api
6m 02s11m 18sOKgetting faster
build.yml
platform-api
3m 41s5m 12sFaststeady
Pipelines sorted by p95 duration; trend arrow signals getting slower or faster
3

Flaky test ranking

Flaky tests ranked by impact. Failure rate and flip rate combined.

Flaky tests ranked by a combined score: failure rate plus flip rate. High failure with consistent failure is broken, not flaky. Low failure with high flip rate is what disrupts merging. Sortable, exportable, scoped per repo or across the estate.

  • Top 10 flakies surface the disruption; long-tail list available behind it
  • Cost-of-quality column: estimated dollar cost per flaky test per month
  • JUnit-style report parsing across GitHub Actions, GitLab, and the others

Outcome: a fix-it backlog already ranked by where the money is.

What's needed: JUnit / xUnit-format test results uploaded as a CI artifact. Native on most stacks.

cicd.watch/stability/flaky-tests

Flaky tests · top 5 by cost-of-quality

TestFailureFlip rateCost/moTrend
checkout.spec.ts
web
11%38%$840up
auth/login.test.ts
web
7%29%$420steady
e2e/cart.e2e.ts
web
6%24%$310down
api/orders.test.ts
platform-api
4%18%$180steady
legacy/import.test.ts
legacy-erp
3%14%$120steady
Flaky tests ranked by combined score: failure rate plus flip rate, monetised

Same connect, more depth

How we work with tech leads

Three more cuts the same data unlocks. Each one helps tech leads pick the right fight rather than the loudest one.

Cost of quality

Cost-of-quality view: flaky reruns and retry compute in dollars

Every flaky test has a dollar cost: compute spent on reruns plus developer wait time. The fix-it ROI per test makes the case for spending the sprint on it.

checkout.spec.ts~$840/mo
auth/login.test.ts~$420/mo
e2e/cart.e2e.ts~$310/mo
See cost-of-quality

PR cycle time

PR cycle time per repo highlights the CI-shaped problems

Repos with long PR cycle time usually have a CI/CD-shaped problem: slow pipelines, flaky reruns, or merge queues stalling on broken trunk. Sort repos by cycle time and the fix-it targets are obvious.

platform-api2d 14h p95
mobile-ios1d 06h p95
web7h 22m p95
See PR health

DORA by repo

DORA-by-repo trend shows which repos are getting better

DORA broken out per repo with 7/30/90-day trend. Each metric rated Elite / High / Medium / Low against the 2024 DORA Report thresholds. Useful for tech leads owning two or three repos: you see your own numbers without estate noise.

platform-api · Lead timeElite ↑
platform-api · Change fail rateHigh →
web · Deployment frequencyMedium ↓
See DORA metrics

All from one connect

Plus the rest of the toolkit

Stability classification is the lead for tech leads. Same connect also gives you DORA, cost tracking, PR health, performance, Slack, CLI, and an MCP server.

Stability classification

Every pipeline auto-classified healthy, flaky, or broken. Trend detection on each.

Flaky tests

Ranked by failure rate and flip rate. Cost-of-quality column included.

Performance ratings

Per-pipeline scoring. p50 and p95 both shown.

Cost tracking

Compute and wait time per pipeline, dollar-normalised across providers. Team tier.

PR health

Per-repo CI failure rates, reviewer wait time, and PR-to-deploy latency. Team tier.

DORA metrics

All five DORA metrics per repo with 7/30/90-day trend lines.

Slack notifications

Stability regressions and broken-trunk alerts in your team channel. Team tier.

MCP server

Hook Claude, Cursor, or any AI agent into live pipeline state.

Pricing

Flat per tenant

Start free for one team. Team and Business tiers are flat monthly rates per tenant. Enterprise is custom for organisations needing SSO, audit logging, security review, and on-premise connector deployment.

Free

For one team getting started with up to 3 repos.

$0/month
Start free
  • 3 repos
  • 1 team member
  • Stability classification
  • DORA metrics, flaky-test ranking
  • Email support
Most popular

Team

Flat rate per tenant. Up to 20 repos and 10 team members.

$29/month
Start Team trial
  • 20 repos
  • 10 team members
  • Everything in Free
  • Cost tracking with full history
  • PR health, performance ratings
  • Slack notifications, CLI, MCP server

Business

Flat rate per tenant. Up to 100 repos and 50 team members.

$99/month
Start Business trial
  • 100 repos
  • 50 team members
  • Everything in Team
  • Audit findings and cost-optimization opportunities
  • Priority support

Comparison

How CI/CD Watch compares for tech leads

A tech lead owning two or three repos who wants stability, flakiness, and DORA in one view. Headline pricing only; deeper feature comparisons live on the linked pages.

CI/CD Watch$29 / mo flat (Team)
BuildPulseFrom $99 / mo (flat tier)See full comparison →
LinearB$29 / contributor / mo (Essentials, annual)See full comparison →
Datadog CI VisibilityFrom $8 / committer / mo + per-span overagesSee full comparison →
Pipeline stability classificationYes, healthy/flaky/brokenPipeline levelLimitedPipeline traces
Flaky test ranking with costYes, dollar-rankedYes, test-level depthNoYes, test-level
Performance ratings per pipelineYes, p50 + p95NoCycle time onlyYes, trace-based
DORA metrics includedYes, all fiveNoYesAdd-on
Cost of quality (dollars)Yes, compute + wait timeNoWait time onlyCompute only
Pricing modelFlat per tenantFlat tiersPer contributorPer committer + spans

Competitor pricing reflects each vendor's published headline rate. See the linked comparison pages for fuller feature matrices and verified sources.

3

stability states per pipeline

2

signals per flaky-test score

p95

duration in every performance rating

Flat

per-tenant pricing

FAQ

Tech lead specifics

How does stability classification decide healthy vs flaky vs broken?
Each pipeline is classified on its recent runs. Healthy is consistently green. Flaky is intermittent failure on the same code (failure with no relevant change between attempts). Broken is sustained failure that needs a fix. The classifier looks at run history, retry behaviour, and whether re-runs flip outcomes without code changes. Thresholds shown on /docs/pipeline-stability.
How is flakiness scored?
Two signals combined: failure rate over the last 30 days and flip rate (how often a re-run on the same commit changes outcome). Tests with high failure but consistent failure are broken, not flaky. Tests with low failure but high flip rate are the most disruptive flakies. The ranking is sortable on the flaky-tests view.
How should I plan a fix-it sprint with this data?
Sort the flaky-test ranking by cost impact (compute reruns plus developer wait time). Pick the top three to five; they typically account for the majority of disruption. Pair that with the broken-pipeline list filtered to your repos. The cost-of-quality view tells you the dollar return on fixing each one.
What counts as a slow outlier in performance ratings?
Performance is per-pipeline percentile-based. A pipeline rated slow has p95 duration well above its peers in the same repo or against similar pipelines across the estate. The rating is durable across runs so day-to-day variance doesn't mislabel a pipeline.
Does this replace BuildPulse?
BuildPulse goes deep on test-level flake history (test names, stack traces, owner mapping). CI/CD Watch covers pipeline-level stability plus test-level flake ranking, with DORA and cost in the same view. For teams whose primary pain is test-debt triage with deep per-test ownership and quarantine workflows, BuildPulse may sit alongside. For teams who want stability, performance, cost, and DORA in one place at flat-rate, CI/CD Watch replaces it. See /compare/buildpulse for the side-by-side.
How does PR cycle time per repo help with priorities?
Repos with long PR cycle times often have a CI/CD-shaped problem: slow pipelines, flaky reruns, or merge queues that stall on broken trunk. Sort repos by cycle time and the worst offenders are usually the right fix-it targets. The view shows median + p95 so you don't over-react to a single tail.
What providers are supported?
GitHub Actions, GitLab CI, Bitbucket Pipelines, CircleCI, Azure DevOps, and Jenkins. Connect each one separately; everything aggregates into one stability view. No agents in any of them. OAuth or access token from us to your provider.
Are the cost-of-quality numbers real dollars?
Yes. Compute cost is calculated from the provider's pricing model (per-minute, credits, slots, or your Jenkins rate). Wait time is calculated from a configurable engineer-hour rate. The flaky-test cost view multiplies failure rate by average rerun cost across the last 30 days. Set rates under Settings.
How does pricing work?
Flat per tenant: $0 Free, $29 Team, $99 Business per month. Repo and team-member caps differ per tier (3/1 on Free, 20/10 on Team, 100/50 on Business). Consumption inside those caps is unmetered. Enterprise is available for SSO, audit log, and security review.

Explore other use cases

See how CI/CD Watch helps every role in your engineering org.

Stop guessing which pipelines to fix.

Connect what you've got in two minutes per provider. Stability, flaky tests, and performance ratings ranked by where the money is.