Jan 31, 2026

Observability Ecosystem Cheat Sheets: Prometheus, Grafana, Alloy, Loki, Mimir, Thanos, Tempo, and Alertmanager

The observability ecosystem is powerful because the tools are composable. It is also confusing because every tool sounds like it can do "monitoring."

This is the cheat sheet I wish I had at the beginning: what each tool does, when to use it, what data it owns, how it connects to the others, and which query language/config shape belongs where.

Open the interactive observability ecosystem explorer

The Ecosystem in One Mental Model

Applications, nodes, clusters, jobs
  -> emit metrics, logs, traces, profiles
  -> collectors/exporters gather telemetry
  -> storage backends retain and index it
  -> query engines analyze it
  -> rules create alerts
  -> Alertmanager routes notifications
  -> Grafana visualizes and ties it together

The clean split:

NeedTool family
Metrics scrape and PromQLPrometheus
Central long-term metricsMimir, Cortex, Thanos
Logs and LogQLLoki
TracesTempo
ProfilesPyroscope
Collection/agent pipelinesAlloy, OpenTelemetry Collector, exporters
Synthetic URL/TCP/ICMP checksBlackbox Exporter
Dashboards and explorationGrafana
Notification routingAlertmanager

Fast Decision Cheat Sheet

If You Need Metrics

Use Prometheus when:

  • you need local scrape-based metrics
  • you want PromQL
  • you need alerting rules close to the source
  • you are starting small or per-cluster

Use Mimir when:

  • you need central multi-tenant long-term metrics
  • many Prometheus/Alloy instances remote-write into one backend
  • you need global PromQL across clusters
  • you need Mimir Ruler for centralized rules

Use Thanos when:

  • you already have Prometheus everywhere
  • you want global query and object storage around existing Prometheus
  • you want to keep cluster-local Prometheus ownership

Use Cortex when:

  • you already operate Cortex
  • compatibility matters more than greenfield simplicity
  • migration to Mimir is not worth it yet

If You Need Logs

Use Loki.

Loki stores labeled log streams. Query with LogQL.

{namespace="prod", app="checkout"} |= "error"

Keep Loki labels low-cardinality. Put request IDs, user IDs, and raw exception bodies in the log line, not stream labels.

If You Need Traces

Use Tempo.

Tempo stores traces and is optimized around trace ID lookup and trace exploration. Traces answer:

What happened across services for this request?

Metrics answer:

How often is this happening?

Logs answer:

What did the service say happened?

If You Need Profiles

Use Pyroscope.

Profiles answer:

Where is CPU, memory, allocation, or lock time going?

Metrics show the symptom. Profiles show where code spends resources.

If You Need Collection

Use Grafana Alloy or OpenTelemetry Collector.

Alloy is useful when you want:

  • Prometheus-native metric pipelines
  • Loki-native log pipelines
  • OTLP receiving/exporting
  • one collector for multiple signals
  • Grafana ecosystem integration

Common Alloy shape:

prometheus.scrape
  -> prometheus.remote_write
  -> Mimir / Prometheus / Grafana Cloud Metrics

loki.source.file
  -> loki.process
  -> loki.write
  -> Loki

otelcol.receiver.otlp
  -> processors
  -> exporters
  -> Tempo / Mimir / Loki

Tool Cheat Sheets

Prometheus

Owns:

  • scrape config
  • local TSDB
  • PromQL
  • recording rules
  • alerting rules
  • local alert delivery to Alertmanager

Core config:

global:
  scrape_interval: 30s

rule_files:
  - /etc/prometheus/rules/*.yaml

scrape_configs:
  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]

Common PromQL:

up == 0

sum by (service) (
  rate(http_requests_total[5m])
)

histogram_quantile(
  0.95,
  sum by (le, service) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

Do:

  • keep labels bounded
  • use recording rules for repeated expensive queries
  • keep local safety alerts in Prometheus
  • monitor scrape health and active series

Avoid:

  • raw user IDs as labels
  • raw URL paths as labels
  • huge scrape payloads
  • long expensive dashboard queries without recording rules

Grafana

Owns:

  • dashboards
  • Explore
  • data source UX
  • alert UI
  • correlations
  • incident views

Grafana does not store Prometheus metrics or Loki logs by itself. It queries backends.

Typical data sources:

  • Prometheus or Mimir for metrics
  • Loki for logs
  • Tempo for traces
  • Pyroscope for profiles
  • Alertmanager for alert state

Grafana Alloy

Owns:

  • telemetry collection
  • scraping
  • log tailing
  • OTLP receiving
  • local processing
  • forwarding

Metrics example:

prometheus.remote_write "mimir" {
  endpoint {
    url = "https://mimir.example.com/api/v1/push"
  }
}

prometheus.scrape "apps" {
  targets = [{
    "__address__" = "checkout:8080",
    "job" = "checkout",
  }]

  forward_to = [prometheus.remote_write.mimir.receiver]
}

Logs example:

loki.write "default" {
  endpoint {
    url = "https://loki.example.com/loki/api/v1/push"
  }
}

loki.source.file "app" {
  targets = [{
    __path__ = "/var/log/app.log",
    job = "checkout",
  }]

  forward_to = [loki.write.default.receiver]
}

Loki

Owns:

  • log ingestion
  • log stream indexing
  • LogQL
  • Loki Ruler

Query examples:

{app="checkout"} |= "error"

sum by (app) (
  count_over_time({namespace="prod"} |= "panic" [5m])
)

Do:

  • use labels for stable stream identity
  • keep high-cardinality data in log body
  • use Loki Ruler for log-native alerts

Avoid:

  • request_id as a Loki label
  • user_id as a Loki label
  • full log message as a label

Mimir

Owns:

  • central Prometheus-compatible metrics storage
  • tenant isolation
  • long retention
  • scalable query path
  • Mimir Ruler
  • Mimir Alertmanager in some deployments

Typical flow:

Prometheus / Alloy / OTel Collector
  -> remote_write
  -> Mimir distributor
  -> ingester
  -> object storage
  -> store-gateway
  -> querier/query-frontend

Use Mimir Ruler for:

  • global SLOs
  • central recording rules
  • tenant-wide alerts
  • multi-cluster service health

Thanos

Owns:

  • global query across Prometheus instances
  • object-storage-backed historical metrics
  • deduplication of HA Prometheus pairs
  • store gateway and compaction

Common flow:

Prometheus
  -> Thanos sidecar
  -> object storage
  -> Thanos store gateway
  -> Thanos query
  -> Grafana

Choose Thanos when you want to keep Prometheus local and add a global query layer.

Cortex

Owns:

  • central remote-write metrics backend
  • multi-tenancy
  • Prometheus-compatible querying
  • Cortex Ruler and Alertmanager integrations

For most greenfield Grafana ecosystem deployments, compare Cortex carefully against Mimir. Cortex is still relevant where it is already deployed or compatibility is required.

Alertmanager

Owns:

  • grouping
  • deduplication
  • silences
  • inhibition
  • notification routing

Does not own:

  • PromQL evaluation
  • LogQL evaluation
  • metric storage
  • log storage

Route example:

route:
  group_by: ["team", "alertname", "environment"]
  receiver: default
  routes:
    - matchers:
        - team="checkout"
        - severity="page"
      receiver: checkout-pagerduty

Blackbox Exporter

Owns:

  • synthetic HTTP probes
  • TCP connect checks
  • ICMP checks
  • TLS checks

Prometheus scrape model:

Prometheus
  -> /probe?target=https://example.com&module=http_2xx
  -> Blackbox Exporter
  -> real target URL

PromQL:

probe_success{job="url-monitors"} == 0
probe_duration_seconds{job="url-monitors"} > 2

OpenTelemetry

Owns:

  • instrumentation APIs
  • semantic conventions
  • OTLP protocol
  • traces, metrics, logs signal model
  • collector pipeline concepts

Useful when:

  • you need vendor-neutral instrumentation
  • traces are first-class
  • multiple languages/services need consistent telemetry
  • you want OTLP as a standard transport

Signal Cheat Sheet

SignalBest forQuery languageBackend
Metricsrates, counts, SLOs, alertsPromQLPrometheus, Mimir, Thanos, Cortex
Logsevents, errors, audit, text searchLogQLLoki
Tracesrequest path across servicesTraceQL / trace lookupTempo
Profilescode-level resource costprofile queriesPyroscope
Synthetic probesoutside-in uptimePromQL over probe metricsBlackbox Exporter + Prometheus

Rule Cheat Sheet

Recording Rule

Creates a new metric.

- record: service:http_requests:rate5m
  expr: |
    sum by (service, status) (
      rate(http_requests_total[5m])
    )

Use for:

  • expensive repeated PromQL
  • dashboards
  • SLO base metrics
  • shared derived metrics

Alerting Rule

Creates an alert.

- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))
    > 0.02
  for: 10m
  labels:
    severity: page
  annotations:
    summary: "High error rate"

Use for:

  • notifying humans
  • triggering automation
  • routeable incident signals

Label Cheat Sheet

Good labels:

  • service
  • job
  • instance
  • cluster
  • namespace
  • team
  • environment
  • status
  • method

Dangerous labels:

  • user_id
  • request_id
  • session_id
  • raw URL path
  • full exception text
  • log message

Rule:

Labels are for bounded dimensions you aggregate or route by.
High-cardinality data belongs in logs, traces, exemplars, or annotations.

Common Architectures

Small Team

Prometheus + Grafana + Alertmanager

Add Loki when logs become important.

Kubernetes Platform

Alloy / exporters
  -> Prometheus or Mimir
  -> Loki
  -> Tempo
  -> Grafana
  -> Alertmanager

Central Observability Platform

Alloy everywhere
  -> Mimir for metrics
  -> Loki for logs
  -> Tempo for traces
  -> Pyroscope for profiles
  -> Grafana for UX
  -> Alertmanager for notifications

Existing Prometheus Fleet

Prometheus per cluster
  -> Thanos sidecar
  -> object storage
  -> Thanos query
  -> Grafana

Final Cheat Sheet

  • Prometheus scrapes metrics.
  • Mimir stores central long-term metrics.
  • Thanos gives global query over Prometheus fleets.
  • Cortex is the older central multi-tenant metrics backend lineage.
  • Loki stores logs.
  • Tempo stores traces.
  • Pyroscope stores profiles.
  • Alloy collects and forwards telemetry.
  • OpenTelemetry standardizes instrumentation and transport.
  • Blackbox Exporter probes URLs, ports, ICMP, and TLS.
  • Alertmanager routes notifications.
  • Grafana is the interface tying it together.

References


Thanks for reading! If you want to see future content, you can follow me on Twitter or get connected over at LinkedIn.


Support My Content

If you find my content helpful, consider supporting a humanitarian cause (building homes for elderly people in rural Terai region of Nepal) that I am planning with your donation:

Ethereum (ETH)

0xB62409A5B227D2aE7D8C66fdaA5EEf4eB4E37959

Thank you for your support!