Send workflow failures to any tool via webhooks
If it can receive a webhook, it can get a CheckyWorky alert.
Why webhooks
Custom routing to any tool or service
Integrate with incident management tools (PagerDuty, Opsgenie, etc.)
Trigger automatic rollbacks, notifications, or downstream workflows
What you'll receive
{
"check_name": "Login flow",
"status": "fail",
"failing_step": "Assert dashboard loads",
"environment": "production",
"screenshot_url": "https://...",
"run_url": "https://app.checkyworky.com/...",
"timestamp": "2026-02-27T10:15:00Z"
}Setup in 3 steps
Add endpoint URL
Enter the URL where you want to receive webhook payloads.
Set signing secret
Optionally add a secret to verify webhook authenticity.
Send a test
Send a test payload to verify everything works.
By the numbers
Organizations with fully implemented AI observability reported 63% less downtime and 60% faster incident resolution than those without.
New Relic, Observability Forecast (2024)High-performing teams are significantly more likely to use automated monitoring and alerting to detect incidents quickly and reduce MTTR.
Google Cloud, DORA Accelerate State of DevOps Report (2023)The average cost of a data breach reached $4.88 million globally.
IBM Security, Cost of a Data Breach Report (2024)API attacks and abuse are a persistent driver of outages and security incidents, increasing the need for authentication, rate limiting, and verification on inbound integrations like webhooks.
Akamai, State of the Internet / API Security and Abuse research (2024)Real-world examples
PagerDuty incident dedupe for a broken checkout journey
Scenario: A small SaaS runs a synthetic workflow every 3 minutes: login → add to cart → checkout. A payment provider starts returning intermittent 502s, causing multiple failures across regions. Without dedupe, every failure would page on-call.
Outcome: Webhook receiver maps failures to a stable fingerprint (check_id + step=checkout + http_status=502) and sets PagerDuty dedup_key to that fingerprint. Result: 1 incident instead of 40+ pages in an hour; MTTA drops because the first alert contains the run URL, failing step, and provider status code.
Slack thread updates instead of alert spam (idempotency + fingerprint)
Scenario: A team routes webhooks to an internal “alerts-router” service that posts to Slack. The same event is retried due to a transient 503, and subsequent runs keep failing with the same auth error.
Outcome: Router stores event_id for delivery dedupe and uses fingerprint to update a single Slack message thread with “still failing” counters and latest run link. Result: alert volume reduced by ~90% during ongoing incidents; engineers can triage from one thread with the newest evidence.
Auto-create Jira ticket only after confirmed failures (noise control)
Scenario: A workflow occasionally flakes due to third-party captcha. The team wants Jira tickets for real regressions, not one-off flakes.
Outcome: Webhook handler opens a Jira issue only after 3 consecutive failures or failures from 2 regions within 10 minutes. Result: fewer false-positive tickets; backlog stays clean while real regressions still generate a trackable work item within minutes.
Safe incident automation: feature-flag rollback gated by conditions
Scenario: After a deploy, the onboarding workflow starts failing at “Create workspace” due to an API validation bug. The team uses feature flags and wants fast mitigation.
Outcome: Webhook triggers an incident + runbook. If failures exceed a threshold and the deploy is within the last 30 minutes, automation flips a feature flag to revert the new onboarding path, then posts confirmation to the incident channel. Result: customer impact window shrinks from ~45 minutes (manual detection + rollback) to ~10–15 minutes (automated mitigation + verification run).
Key insights
1.
Design webhook delivery as at-least-once: retries will happen, so idempotency (event_id/dedupe_key) is not optional if you want reliable incident automation.
2.
Separate delivery dedupe (same event resent) from incident dedupe (new events caused by the same underlying failure) to prevent both duplicate processing and alert storms.
3.
Version your payload schema and keep it additive. This lets small teams evolve routing logic without breaking existing receivers or automation scripts.
4.
Signatures should be computed over the raw body (plus timestamp) and verified with constant-time comparison; timestamp validation meaningfully reduces replay risk for shared-secret webhooks.
5.
Noise control beats “more alerts”: require confirmation (multiple consecutive failures, multi-region confirmation, or error-class thresholds) before paging or opening tickets.
6.
Send references, not secrets: payloads should include URLs to evidence (run links, screenshots) rather than embedding PII, tokens, or raw request headers.
7.
A lightweight “alerts router” service (or serverless function) often pays off for 2–15 person teams: one place for dedupe, routing rules, rate limiting, and formatting across Slack/PagerDuty/Jira.
Pro tips
💡
Implement a tiny webhook receiver (Cloudflare Workers/AWS Lambda) that: verifies HMAC + timestamp, persists event_id for 7–30 days, and forwards to Slack/PagerDuty/Jira with a fingerprint-based dedupe key.
💡
Use two thresholds: one for “notify” (e.g., first failure to Slack) and one for “page” (e.g., 2 consecutive failures or multi-region confirmation). This keeps humans focused while still capturing early signals.
💡
Normalize errors before fingerprinting (strip request IDs, timestamps, dynamic IDs) so “same root cause” groups correctly and your dedupe key stays stable across retries and runs.
How CheckyWorky compares
vs Datadog Synthetics
Strong enterprise observability suite, but teams often need additional configuration to translate synthetic failures into workflow-specific incident routing. CheckyWorky’s focus is on end-to-end SaaS workflow monitoring and sending failure context (step, run URL, artifacts) cleanly via webhooks for custom automation.
vs Checkly
Developer-centric checks with flexible scripting; webhooks are possible, but many teams still build their own dedupe/routing layer. CheckyWorky emphasizes “pretend customer” workflows and practical incident payload fields (fingerprints, run context) designed for downstream deduplication and automation.
vs UptimeRobot
Great for simple uptime/HTTP checks and basic alerts, but limited for multi-step user journeys (login → action → payment) and rich failure artifacts. CheckyWorky is optimized for workflow failures and sending actionable context to any tool via webhooks.