Jan 31, 2026

Prometheus: Practical Guide & Mental Model

Prometheus is a pull-based monitoring system and time-series database designed for reliable metrics collection, alerting, and exploration. This guide gives you a compact mental model and the practical pieces you need to operate it at scale.

Targets
up 1http_requests_totalnode_cpu_seconds_totalprobe_successlatency_bucket
Prometheus
scrape looprelabelTSDB blocksrule engine
Outputs
rate()sum by()histogram_quantile()alert rules
Prometheus pulls metrics, rewrites labels, stores time-series locally, evaluates rules, and serves PromQL to dashboards and alerts.

Open the interactive Prometheus architecture explorer

1) What Prometheus Is (and isn’t)

Prometheus is great for infrastructure metrics, application telemetry, and alerting. It is not a long-term log archive or a general-purpose OLAP system.

Key properties:

  • Single static binary (cross-platform)
  • Pulls metrics over HTTP
  • Stores time-series locally
  • Labels as first-class dimensions
  • PromQL for queries

2) Architecture in One Diagram

[ Linux / Windows / Apps ]
          |
          |  expose /metrics (HTTP)
          v
[ Exporters / Instrumented Apps ]
          |
          |  scrape (HTTP GET)
          v
[ Prometheus Server ]
          |
          |  PromQL queries
          v
[ Grafana / Alerts ]

3) “Scrape” Means Prometheus Pulls

In Prometheus, a scrape is when Prometheus initiates an HTTP request to a target and pulls metrics.

Concretely:

Prometheus  ──HTTP GET──▶  http://target:port/metrics

What that implies:

  • The target does nothing proactively
  • The target only exposes /metrics
  • Prometheus controls when, how often, and how long it waits

Scrape loop (per target):

every scrape_interval:
  start timer
  GET /metrics
  parse text format
  store time-series
  stop timer

That’s why you see:

  • scrape_interval: how often Prometheus scrapes a target
  • scrape_timeout: the max time Prometheus waits for a scrape to finish
  • scrape_duration: how long the last scrape actually took (a measured duration, not a config setting)

Pull vs Push (contrast):

  • Prometheus (Pull / Scrape): Prometheus calls you, centralized control, easier debugging, safer at scale
  • Push systems: Apps push metrics out, harder governance, more network + retry complexity

Prometheus can accept pushed metrics via Pushgateway, but that’s the exception, not the norm.

Why pull matters operationally:

  • Centralized scrape schedules
  • Uniform auth and TLS
  • Easier service discovery
  • Built-in health signal (up)

If a scrape fails, Prometheus knows immediately. In push systems, silence can look like “everything is fine.”

One sentence to remember:

In Prometheus, a “scrape” is Prometheus pulling metrics over HTTP from a target.

4) Exporters: Turning State Into Metrics

Exporters translate system or app state into Prometheus metrics.

Node Exporter (Linux)

Common metrics:

  • CPU: node_cpu_seconds_total
  • Memory: node_memory_MemFree_bytes
  • Disk: node_disk_io_time_seconds_total, node_disk_read_bytes_total
  • Network: node_network_receive_bytes_total

Example:

node_disk_io_time_seconds_total{device="sda"} 104296

Windows Exporter

Common metrics:

  • windows_cpu_time_total
  • windows_memory_available_bytes
  • windows_logical_disk_free_bytes

Application & Batch Exporters

Libraries exist for Java, Python, Go, and Node.js. Example batch metrics:

process_cpu_seconds_total 5.73
worker_jobs_total{status="processed"} 1570222
worker_jobs_total{status="failed"} 155665

Blackbox Exporter (active probing)

For endpoints that don’t expose /metrics, the Blackbox Exporter performs HTTP/TCP/ICMP/TLS probes and exposes results as scrapeable metrics.

Prometheus Targets: Quick Mapping (Color-Coded)

Key point: Prometheus does NOT auto-detect processes. A running process alone gives Prometheus nothing (Linux or Windows). You must expose metrics or use exporters.

NeedWhat Prometheus needsHow to get itNotes
Node.js app metrics/metrics endpointInstrument app (prom-client)Prometheus scrapes the app directly
Linux host metricsHost metrics endpointInstall node-exporterCPU, RAM, disk, etc.
Windows host metricsHost metrics endpointInstall windows_exporterPrometheus can’t see Windows processes by itself
Availability / ping / port checkBlackbox probeblackbox_exporter (ICMP, HTTP, TCP)Use for URL, ping, port-open checks
URL uptimeHTTP probeblackbox exporter (http_2xx)Returns success/failure, latency
Ping / ICMPICMP probeblackbox exporter (icmp)Requires ICMP permissions
Port open (e.g., 443, 27017)TCP probeblackbox exporter (tcp_connect)Validates port reachability
MongoDB metricsMongoDB metrics endpointmongodb_exporterNot automatic; needs exporter
Custom app metrics/metrics endpointAdd a custom exporter or instrument the appPrometheus only scrapes exposed metrics

5) Metric Types (What to Use and When)

Counter

Monotonically increasing; resets on restart.

  • process_cpu_seconds_total
  • http_requests_total

Gauge

Can go up and down.

  • node_memory_MemFree_bytes
  • queue_depth

Histogram (preferred for latency)

Bucketed distribution; aggregatable across instances.

  • http_request_duration_seconds_bucket
  • http_request_duration_seconds_sum
  • http_request_duration_seconds_count

Summary (use carefully)

Client-side quantiles; not aggregatable across instances.

6) Labels: Prometheus’ Core Data Model

Every metric is a name + labels:

metric_name{label1="value1", label2="value2"} value @ timestamp

Examples:

node_disk_io_time_seconds_total{device="sda", instance="linux-1"} 104296
http_requests_total{method="GET", status="200", service="api"} 982734

Labels give you slicing, aggregation, and multi-dimensional queries. Too many labels = high cardinality, which costs CPU and storage.

7) Prometheus Configuration (prometheus.yml)

Global config

global:
  scrape_interval: 10s

Static scrape configs

scrape_configs:
  - job_name: "linux"
    static_configs:
      - targets: ["ip-linux:9100"]

  - job_name: "batch"
    static_configs:
      - targets: ["web-app:8080"]

  - job_name: "windows"
    static_configs:
      - targets: ["win-2019:9182"]

job_name becomes a label automatically, which is useful for grouping.

Custom scrape example (Node.js app)

Definition: a custom scrape is any job you define in scrape_configs for a target that is specific to your environment (a service, exporter, device, or endpoint you decide to monitor), beyond the default examples.

What a custom scrape could represent (examples):

  • A Node.js API exposing /metrics
  • A Python worker exposing /metrics
  • A Go service exposing /metrics
  • A Linux node exporter (:9100)
  • A database exporter (Postgres, Redis, MySQL)
  • A load balancer or proxy exporter (Nginx, HAProxy)
  • A message queue exporter (Kafka, RabbitMQ)
  • A Kubernetes component (kube-state-metrics, cAdvisor)
  • A blackbox probe (HTTP/TCP/ICMP checks)
  • A third-party SaaS metrics endpoint

Practical example: your Node.js service exposes http://localhost:3000/metrics (via prom-client or similar). Prometheus will hit that URL every 30s, wait up to 30s for a response, and attach the label env="dev" to all ingested series from that target.

Quick sanity checks:

Sample /metrics output (what Prometheus sees):

# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1243
process_cpu_seconds_total 5.73

Verify scrape success in PromQL:

up{job="nodejs-app"}
scrape_configs:
  - job_name: "nodejs-app"
    scrape_interval: 30s
    scrape_timeout: 30s
    metrics_path: /metrics
    scheme: http
    static_configs:
      - targets:
          - localhost:3000
        labels:
          env: dev

    # auth (choose one)
    # no auth (default): nothing to set
    # bearer_token: "YOUR_TOKEN"
    # bearer_token_file: /etc/prometheus/token
    # basic_auth:
    #   username: "user"
    #   password: "pass"
    # authorization:
    #   type: Bearer
    #   credentials: "YOUR_TOKEN"

    # tls / https (if needed)
    # scheme: https
    # tls_config:
    #   ca_file: /etc/prometheus/ca.pem
    #   cert_file: /etc/prometheus/client.pem
    #   key_file: /etc/prometheus/client.key
    #   insecure_skip_verify: false
    #   (if true, Prometheus skips TLS certificate verification;
    #    useful for self-signed certs in dev, but unsafe for prod)

8) Blackbox Exporter (Active Probing)

The Blackbox Exporter lets Prometheus monitor things that don’t expose /metrics themselves (URLs, ports, ICMP ping, TLS checks). Prometheus still scrapes the exporter; the exporter actively probes targets and returns results as metrics.

What it’s used for:

  • URL uptime and latency (HTTP/HTTPS)
  • TCP port availability
  • ICMP ping reachability
  • TLS handshake and certificate checks

Minimal wiring model:

Prometheus ──scrape──▶ Blackbox Exporter ──probe──▶ Target

9) Ping Monitoring Architecture (Blackbox / ICMP)

Ping-style monitoring in Prometheus is typically done via the Blackbox Exporter, which probes targets and exposes results for Prometheus to scrape.

High-level flow:

Prometheus ──scrape──▶ Blackbox Exporter
                 │
                 └──probe (ICMP/HTTP/TCP)──▶ Target

Key idea: Prometheus still pulls from the exporter; the exporter pushes probes to the target and reports success, latency, and errors as metrics.

Typical setup:

  1. Run Blackbox Exporter (in the same network as Prometheus or near targets)
  2. Configure scrape_configs with metrics_path: /probe
  3. Pass module and target as query params
  4. Query probe_success, probe_duration_seconds, and probe_icmp_*

Example config snippet:

scrape_configs:
  - job_name: "ping"
    metrics_path: /probe
    params:
      module: [icmp]
    static_configs:
      - targets:
          - 1.1.1.1
          - 8.8.8.8
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

PromQL checks:

probe_success{job="ping"}
probe_duration_seconds{job="ping"}

Use this for reachability and latency checks across networks; it complements service-level metrics rather than replacing them.

10) Port Monitoring Architecture (TCP Checks)

Port monitoring is also done via Blackbox Exporter, using the tcp module to test if a port is reachable (and optionally perform a simple handshake).

High-level flow:

Prometheus ──scrape──▶ Blackbox Exporter
                 │
                 └──probe (TCP connect)──▶ Target:Port

Typical setup:

  1. Run Blackbox Exporter
  2. Configure scrape_configs with metrics_path: /probe
  3. Set module: [tcp_connect]
  4. Query probe_success and probe_duration_seconds

Example config snippet:

scrape_configs:
  - job_name: "ports"
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets:
          - 10.0.1.10:22
          - 10.0.1.20:5432
          - example.com:443
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

PromQL checks:

probe_success{job="ports"}
probe_duration_seconds{job="ports"}

Use this for reachability and basic port availability; it complements service metrics and deeper health checks.

11) MongoDB Monitoring (Exporter)

MongoDB monitoring in Prometheus is typically done via the MongoDB Exporter, which exposes MongoDB stats at /metrics for Prometheus to scrape.

High-level flow:

MongoDB ──stats──▶ MongoDB Exporter ──/metrics──▶ Prometheus

Typical setup:

  1. Run MongoDB Exporter near your database
  2. Provide a MongoDB connection URI with a read-only user
  3. Add a scrape_configs job for the exporter
  4. Query key metrics like connections, ops, and replication lag

Example config snippet:

scrape_configs:
  - job_name: "mongodb"
    static_configs:
      - targets: ["mongodb-exporter:9216"]

Common metrics to watch:

  • mongodb_connections{state="current"}
  • mongodb_op_counters_total
  • mongodb_mongod_replset_member_state
  • mongodb_replset_lag

Use this for database health, throughput, and replication visibility; pair it with application-level metrics for end-to-end views.

12) URL Monitoring (HTTP Checks)

URL monitoring in Prometheus is usually done via the Blackbox Exporter using the http module to check availability, status codes, and latency.

High-level flow:

Prometheus ──scrape──▶ Blackbox Exporter
                 │
                 └──probe (HTTP GET/HEAD)──▶ URL

Typical setup:

  1. Run Blackbox Exporter
  2. Configure scrape_configs with metrics_path: /probe
  3. Set module: [http_2xx] (or your custom module)
  4. Query probe_success, probe_http_status_code, probe_duration_seconds

Example config snippet:

scrape_configs:
  - job_name: "urls"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://status.example.com/health
          - http://internal-api:8080/healthz
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

PromQL checks:

probe_success{job="urls"}
probe_http_status_code{job="urls"}
probe_duration_seconds{job="urls"}

Use this for uptime, HTTP status, and latency checks; pair it with application metrics for deeper diagnostics.

13) Service Discovery (When Static Targets Don’t Scale)

File-based discovery

file_sd_configs:
  - files:
      - /etc/prometheus/targets/*.json

Example JSON:

[
  {
    "targets": ["10.0.1.5:9100"],
    "labels": {
      "env": "prod",
      "team": "infra"
    }
  }
]

Other discovery options:

  • DNS / SRV records
  • Kubernetes (pod, service, node)
  • Cloud providers (AWS, GCP, Azure)

14) Relabeling (Critical to Cost and Scale)

Relabeling happens in two stages:

Target relabeling (before scrape)

relabel_configs:
  - source_labels: [__address__]
    regex: ".*:9100"
    action: keep

Typical actions: keep, drop, replace, labelmap.

Metric relabeling (after scrape)

metric_relabel_configs:
  - source_labels: [__name__]
    regex: "node_disk_io_time_seconds_total"
    action: drop

This is the last chance to reduce cardinality before storage.

15) Common Relabeling Patterns

Drop noisy metrics:

metric_relabel_configs:
  - source_labels: [__name__]
    regex: "node_network_.*"
    action: drop

Remove high-cardinality labels:

metric_relabel_configs:
  - regex: "pod_uid"
    action: labeldrop

Rename labels:

relabel_configs:
  - source_labels: [instance]
    target_label: host

16) PromQL: Practical Queries

CPU usage:

rate(process_cpu_seconds_total[5m])

Failed jobs:

sum(worker_jobs_total{status="failed"})

Disk I/O by device:

rate(node_disk_io_time_seconds_total[5m])

17) Storage Model (TSDB)

Prometheus stores data locally in time-based blocks. Retention is configurable:

--storage.tsdb.retention.time=15d

18) One-line Mental Model

Prometheus scrapes /metrics, relabels targets and metrics, stores time-series locally, and lets you query everything by labels.

19) Key Takeaways

  • Exporters expose metrics over HTTP
  • Prometheus pulls metrics on a schedule
  • Labels power aggregation and filtering
  • Relabeling controls cost and scale
  • Histograms are the default for latency
  • Service discovery is essential at scale

20) The Full Lifecycle of a Sample

If you understand the lifecycle of one sample, Prometheus becomes much easier to operate.

instrumented app
  -> exposes /metrics
  -> service discovery finds target
  -> target relabeling rewrites target labels
  -> Prometheus scrapes target
  -> metric relabeling can drop or rewrite samples
  -> TSDB writes samples to WAL
  -> samples compact into blocks
  -> PromQL reads series by label matchers
  -> rules evaluate expressions
  -> alerts go to Alertmanager

Every expensive Prometheus problem usually happens at one of these points:

  • too many discovered targets
  • too many samples per scrape
  • too many labels per metric
  • too many unique label values
  • expensive PromQL over too much history
  • alert rules that evaluate too often or over too many series

21) Cardinality: The Real Scaling Limit

Prometheus scales by time series, not by metric names alone.

This is one metric name:

http_requests_total

This may be thousands or millions of series:

http_requests_total{
  method="GET",
  status="200",
  route="/users/:id",
  instance="api-17",
  pod="checkout-6f7d9c",
  namespace="prod"
}

The number of active series is roughly:

metric names x label value combinations

Good labels:

  • job
  • service
  • instance
  • cluster
  • namespace
  • method
  • status
  • bounded route names like /users/:id

Dangerous labels:

  • user_id
  • request_id
  • session_id
  • raw URL paths with IDs
  • email addresses
  • UUIDs
  • exception messages
  • unbounded Kubernetes pod UIDs unless truly needed

Cardinality rule of thumb:

If a label value can grow with users, requests, sessions, or events, it probably does not belong on a metric.

22) PromQL Patterns That Matter

Counter rate

Use rate() for counters:

sum by (service) (
  rate(http_requests_total[5m])
)

Error rate

sum by (service) (
  rate(http_requests_total{status=~"5.."}[5m])
)
/
sum by (service) (
  rate(http_requests_total[5m])
)

Saturation

1 - (
  node_filesystem_avail_bytes{mountpoint="/"}
  /
  node_filesystem_size_bytes{mountpoint="/"}
)

Latency from classic histograms

histogram_quantile(
  0.95,
  sum by (le, service) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

Availability from blackbox probes

avg_over_time(probe_success{job="urls"}[30m])

“Is my target healthy?”

up{job="nodejs-app"} == 0

23) Recording Rules vs Alerting Rules

Prometheus rules are evaluated periodically.

Recording rules precompute PromQL and write the result back as a new time series. Use them when a query is expensive, repeated often, or used by many dashboards.

groups:
  - name: service-recording-rules
    interval: 30s
    rules:
      - record: service:http_requests:rate5m
        expr: |
          sum by (service, status) (
            rate(http_requests_total[5m])
          )

Alerting rules evaluate a condition and send firing alerts to Alertmanager.

groups:
  - name: service-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum by (service) (service:http_requests:rate5m{status=~"5.."})
          /
          sum by (service) (service:http_requests:rate5m)
          > 0.02
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "High 5xx rate for {{ $labels.service }}"

The for clause prevents noisy one-scrape spikes from paging you.

24) Alertmanager Mental Model

Prometheus detects. Alertmanager routes.

Prometheus asks:

Is this alert condition true?

Alertmanager asks:

Who should hear about it, when, and with what grouping?

Alertmanager handles:

  • grouping related alerts
  • deduplicating HA Prometheus alerts
  • silencing planned maintenance
  • inhibiting symptom alerts when a root-cause alert is firing
  • routing by team, severity, service, env, or cluster

Basic flow:

Prometheus rule fires
  -> Alertmanager receives alert
  -> groups by service/cluster
  -> applies inhibition and silences
  -> sends Slack/PagerDuty/email/webhook

25) Remote Write and Long-Term Storage

Prometheus local storage is intentionally local. It is excellent for short-to-medium retention and fast local queries, but it is not a distributed long-term warehouse.

When you need central storage, use remote write:

remote_write:
  - url: "https://mimir.example.com/api/v1/push"
    headers:
      X-Scope-OrgID: "platform-prod"

Common remote-write destinations:

  • Grafana Mimir
  • Cortex
  • Thanos Receive
  • Grafana Cloud Metrics
  • other Prometheus-compatible remote-write backends

Remote write changes the architecture:

Prometheus scrapes locally
  -> writes local WAL
  -> remote_write queues samples
  -> central backend stores long retention
  -> Grafana queries central backend

Keep local Prometheus alerts for local safety checks even if you remote-write to a central backend.

26) Production Checklist

Scrape health

  • Watch up.
  • Watch scrape_duration_seconds.
  • Watch scrape_samples_scraped.
  • Keep scrape_timeout below scrape_interval.
  • Avoid scraping huge endpoints too frequently.

TSDB health

  • Watch active series count.
  • Watch WAL and disk usage.
  • Watch compaction duration.
  • Set retention intentionally.
  • Use fast local disk.

Query health

  • Prefer recording rules for repeated expensive dashboard queries.
  • Avoid unbounded regexes over large label sets.
  • Aggregate before histogram_quantile.
  • Use dashboards that query recording rules where possible.

Label governance

  • Define allowed labels for app metrics.
  • Require route templates, not raw paths.
  • Drop high-cardinality labels at scrape time.
  • Keep job, service, cluster, and env consistent.

Alert quality

  • Page on symptoms users feel, not every internal signal.
  • Use for durations.
  • Add runbook links.
  • Route by team and severity.
  • Test rules with promtool.

27) Master Architecture

For a mature setup, the clean shape is:

apps/exporters/blackbox
  -> Prometheus scrape
  -> relabel and metric relabel
  -> local TSDB
  -> local rules for safety alerts
  -> Alertmanager

Prometheus remote_write
  -> Mimir/Thanos Receive/Cortex
  -> long retention and global dashboards

Grafana
  -> PromQL dashboards
  -> alert views
  -> drilldown by service, cluster, team, and route

The clean responsibility split:

  • Prometheus owns local scrape, local storage, local rules, and fast troubleshooting.
  • Alertmanager owns notification routing.
  • Grafana owns dashboards and exploration.
  • Mimir/Thanos/Cortex own long-term or global metrics when Prometheus alone is not enough.

28) URL Monitor Architecture: UI + DB + Autosys + Blackbox + Prometheus

A URL monitoring product usually should not ask engineers to edit Prometheus YAML by hand every time they add a URL. A cleaner architecture is to make your application database the source of truth, then generate Prometheus Blackbox targets from it.

End-to-end flow:

User
  -> URL Monitor UI
  -> CRUD API
  -> Monitor DB source of truth
  -> Autosys scheduled sync job
  -> generated file_sd JSON
  -> Prometheus url-monitors scrape job
  -> Blackbox Exporter /probe
  -> real URL target
  -> probe_* metrics
  -> Prometheus rules
  -> Alertmanager
  -> Slack/PagerDuty/email/webhook

The key design choice: Prometheus should scrape generated desired state; the DB should remain the source of truth.

Monitor DB Record

The UI creates, updates, pauses, and deletes monitor records. The DB row should include ownership and alert metadata, not just the URL.

{
  "id": "mon_1001",
  "name": "checkout health",
  "url": "https://checkout.example.com/health",
  "module": "http_2xx",
  "interval_seconds": 30,
  "enabled": true,
  "team": "checkout",
  "severity": "page",
  "environment": "prod",
  "expected_status": 200,
  "timeout_seconds": 5
}

Autosys Sync Job

Autosys can run a scheduled reconciliation job:

every 5 minutes:
  read enabled monitors from DB
  validate URL/module/team/severity
  render file_sd JSON
  write to temp file
  atomically rename temp file into Prometheus file_sd directory
  reload Prometheus or let file_sd refresh pick it up
  emit sync metrics and logs

The sync job should be idempotent. If the same DB state is read twice, it should produce the same target file.

Generated file_sd JSON

Prometheus file service discovery expects a list of target groups.

[
  {
    "targets": ["https://checkout.example.com/health"],
    "labels": {
      "monitor_id": "mon_1001",
      "monitor_name": "checkout_health",
      "team": "checkout",
      "severity": "page",
      "environment": "prod",
      "module": "http_2xx"
    }
  }
]

The labels become alert routing metadata later.

Prometheus Scrape Config for Blackbox

Prometheus does not scrape the URL directly. It scrapes Blackbox Exporter and passes the URL as __param_target.

scrape_configs:
  - job_name: "url-monitors"
    metrics_path: /probe
    scrape_interval: 30s
    file_sd_configs:
      - files:
          - /etc/prometheus/file_sd/url-monitors.json

    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target

      - source_labels: [module]
        target_label: __param_module

      - source_labels: [__param_target]
        target_label: instance

      - target_label: __address__
        replacement: blackbox-exporter:9115

Read the last relabeling block carefully:

  • original target is the real URL
  • __param_target becomes the URL Blackbox should probe
  • __param_module becomes the Blackbox module, such as http_2xx
  • __address__ is replaced with the Blackbox Exporter address

So Prometheus calls:

http://blackbox-exporter:9115/probe?target=https://checkout.example.com/health&module=http_2xx

Blackbox then probes:

https://checkout.example.com/health

Blackbox Exporter Module

Example module:

modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      method: GET
      valid_status_codes: [200]
      preferred_ip_protocol: ip4
      fail_if_ssl: false
      fail_if_not_ssl: true

Useful PromQL

Down monitors:

probe_success{job="url-monitors"} == 0

Unexpected status:

probe_http_status_code{job="url-monitors"} != 200

Slow probes:

probe_duration_seconds{job="url-monitors"} > 2

Availability over 30 minutes:

avg_over_time(probe_success{job="url-monitors"}[30m])

Group by team:

avg by (team, environment) (
  probe_success{job="url-monitors"}
)

Prometheus Alert Rules

groups:
  - name: url-monitor-alerts
    interval: 30s
    rules:
      - alert: UrlMonitorDown
        expr: probe_success{job="url-monitors"} == 0
        for: 3m
        labels:
          severity: "{{ $labels.severity }}"
          team: "{{ $labels.team }}"
          monitor_id: "{{ $labels.monitor_id }}"
          environment: "{{ $labels.environment }}"
        annotations:
          summary: "URL monitor failed: {{ $labels.monitor_name }}"
          description: "{{ $labels.instance }} has failed blackbox probes for 3 minutes."

      - alert: UrlMonitorUnexpectedStatus
        expr: probe_http_status_code{job="url-monitors"} != 200
        for: 3m
        labels:
          severity: "{{ $labels.severity }}"
          team: "{{ $labels.team }}"
        annotations:
          summary: "Unexpected status for {{ $labels.monitor_name }}"
          description: "{{ $labels.instance }} returned HTTP {{ $value }}."

      - alert: UrlMonitorSlow
        expr: probe_duration_seconds{job="url-monitors"} > 2
        for: 5m
        labels:
          severity: warning
          team: "{{ $labels.team }}"
        annotations:
          summary: "URL monitor is slow: {{ $labels.monitor_name }}"

Alertmanager Route

route:
  group_by: ["team", "alertname", "environment"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 2h
  receiver: default
  routes:
    - matchers:
        - team="checkout"
        - severity="page"
      receiver: checkout-pagerduty

receivers:
  - name: default
    slack_configs:
      - channel: "#observability"

  - name: checkout-pagerduty
    pagerduty_configs:
      - routing_key_file: /etc/alertmanager/pagerduty-checkout-key
        description: "{{ .CommonAnnotations.summary }}"

Alert Payload Shape

Alertmanager receives a structured alert like this:

{
  "status": "firing",
  "labels": {
    "alertname": "UrlMonitorDown",
    "monitor_id": "mon_1001",
    "monitor_name": "checkout_health",
    "team": "checkout",
    "severity": "page",
    "environment": "prod",
    "instance": "https://checkout.example.com/health"
  },
  "annotations": {
    "summary": "URL monitor failed: checkout_health",
    "description": "https://checkout.example.com/health has failed blackbox probes for 3 minutes."
  }
}

Operational Notes

  • The UI should only write to the DB, not directly edit Prometheus config.
  • Autosys should generate config from DB desired state.
  • Prometheus should consume generated file_sd targets.
  • Blackbox Exporter should do the real HTTP/TCP/ICMP probe.
  • Labels from the DB should drive Alertmanager routing.
  • Keep monitor labels low-cardinality: team, environment, monitor_id, monitor_name, severity.
  • Store detailed history in the app DB if the product needs audit, ownership, or CRUD history.

29) References


Thanks for reading! If you want to see future content, you can follow me on Twitter or get connected over at LinkedIn.


Support My Content

If you find my content helpful, consider supporting a humanitarian cause (building homes for elderly people in rural Terai region of Nepal) that I am planning with your donation:

Ethereum (ETH)

0xB62409A5B227D2aE7D8C66fdaA5EEf4eB4E37959

Thank you for your support!