Jan 31, 2026

Observability Ecosystem Cheat Sheets: Prometheus, Grafana, Alloy, Loki, Mimir, Thanos, Tempo, and Alertmanager

The observability ecosystem is powerful because the tools are composable. It is also confusing because every tool sounds like it can do "monitoring."

This is the cheat sheet I wish I had at the beginning: what each tool does, when to use it, what data it owns, how it connects to the others, and which query language/config shape belongs where.

Open the interactive observability ecosystem explorer

The Ecosystem in One Mental Model

Applications, nodes, clusters, jobs
  -> emit metrics, logs, traces, profiles
  -> collectors/exporters gather telemetry
  -> storage backends retain and index it
  -> query engines analyze it
  -> rules create alerts
  -> Alertmanager routes notifications
  -> Grafana visualizes and ties it together

The clean split:

Need	Tool family
Metrics scrape and PromQL	Prometheus
Central long-term metrics	Mimir, Cortex, Thanos
Logs and LogQL	Loki
Traces	Tempo
Profiles	Pyroscope
Collection/agent pipelines	Alloy, OpenTelemetry Collector, exporters
Synthetic URL/TCP/ICMP checks	Blackbox Exporter
Dashboards and exploration	Grafana
Notification routing	Alertmanager

Fast Decision Cheat Sheet

If You Need Metrics

Use Prometheus when:

you need local scrape-based metrics
you want PromQL
you need alerting rules close to the source
you are starting small or per-cluster

Use Mimir when:

you need central multi-tenant long-term metrics
many Prometheus/Alloy instances remote-write into one backend
you need global PromQL across clusters
you need Mimir Ruler for centralized rules

Use Thanos when:

you already have Prometheus everywhere
you want global query and object storage around existing Prometheus
you want to keep cluster-local Prometheus ownership

Use Cortex when:

you already operate Cortex
compatibility matters more than greenfield simplicity
migration to Mimir is not worth it yet

If You Need Logs

Use Loki.

Loki stores labeled log streams. Query with LogQL.

{namespace="prod", app="checkout"} |= "error"

Keep Loki labels low-cardinality. Put request IDs, user IDs, and raw exception bodies in the log line, not stream labels.

If You Need Traces

Use Tempo.

Tempo stores traces and is optimized around trace ID lookup and trace exploration. Traces answer:

What happened across services for this request?

Metrics answer:

How often is this happening?

Logs answer:

What did the service say happened?

If You Need Profiles

Use Pyroscope.

Profiles answer:

Where is CPU, memory, allocation, or lock time going?

Metrics show the symptom. Profiles show where code spends resources.

If You Need Collection

Use Grafana Alloy or OpenTelemetry Collector.

Alloy is useful when you want:

Prometheus-native metric pipelines
Loki-native log pipelines
OTLP receiving/exporting
one collector for multiple signals
Grafana ecosystem integration

Common Alloy shape:

prometheus.scrape
  -> prometheus.remote_write
  -> Mimir / Prometheus / Grafana Cloud Metrics

loki.source.file
  -> loki.process
  -> loki.write
  -> Loki

otelcol.receiver.otlp
  -> processors
  -> exporters
  -> Tempo / Mimir / Loki

Tool Cheat Sheets

Prometheus

Owns:

scrape config
local TSDB
PromQL
recording rules
alerting rules
local alert delivery to Alertmanager

Core config:

global:
  scrape_interval: 30s

rule_files:
  - /etc/prometheus/rules/*.yaml

scrape_configs:
  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]

Common PromQL:

up == 0

sum by (service) (
  rate(http_requests_total[5m])
)

histogram_quantile(
  0.95,
  sum by (le, service) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

Do:

keep labels bounded
use recording rules for repeated expensive queries
keep local safety alerts in Prometheus
monitor scrape health and active series

Avoid:

raw user IDs as labels
raw URL paths as labels
huge scrape payloads
long expensive dashboard queries without recording rules

Grafana

Owns:

dashboards
Explore
data source UX
alert UI
correlations
incident views

Grafana does not store Prometheus metrics or Loki logs by itself. It queries backends.

Typical data sources:

Prometheus or Mimir for metrics
Loki for logs
Tempo for traces
Pyroscope for profiles
Alertmanager for alert state

Grafana Alloy

Owns:

telemetry collection
scraping
log tailing
OTLP receiving
local processing
forwarding

Metrics example:

prometheus.remote_write "mimir" {
  endpoint {
    url = "https://mimir.example.com/api/v1/push"
  }
}

prometheus.scrape "apps" {
  targets = [{
    "__address__" = "checkout:8080",
    "job" = "checkout",
  }]

  forward_to = [prometheus.remote_write.mimir.receiver]
}

Logs example:

loki.write "default" {
  endpoint {
    url = "https://loki.example.com/loki/api/v1/push"
  }
}

loki.source.file "app" {
  targets = [{
    __path__ = "/var/log/app.log",
    job = "checkout",
  }]

  forward_to = [loki.write.default.receiver]
}

Loki

Owns:

log ingestion
log stream indexing
LogQL
Loki Ruler

Query examples:

{app="checkout"} |= "error"

sum by (app) (
  count_over_time({namespace="prod"} |= "panic" [5m])
)

Do:

use labels for stable stream identity
keep high-cardinality data in log body
use Loki Ruler for log-native alerts

Avoid:

request_id as a Loki label
user_id as a Loki label
full log message as a label

Mimir

Owns:

central Prometheus-compatible metrics storage
tenant isolation
long retention
scalable query path
Mimir Ruler
Mimir Alertmanager in some deployments

Typical flow:

Prometheus / Alloy / OTel Collector
  -> remote_write
  -> Mimir distributor
  -> ingester
  -> object storage
  -> store-gateway
  -> querier/query-frontend

Use Mimir Ruler for:

global SLOs
central recording rules
tenant-wide alerts
multi-cluster service health

Thanos

Owns:

global query across Prometheus instances
object-storage-backed historical metrics
deduplication of HA Prometheus pairs
store gateway and compaction

Common flow:

Prometheus
  -> Thanos sidecar
  -> object storage
  -> Thanos store gateway
  -> Thanos query
  -> Grafana

Choose Thanos when you want to keep Prometheus local and add a global query layer.

Cortex

Owns:

central remote-write metrics backend
multi-tenancy
Prometheus-compatible querying
Cortex Ruler and Alertmanager integrations

For most greenfield Grafana ecosystem deployments, compare Cortex carefully against Mimir. Cortex is still relevant where it is already deployed or compatibility is required.

Alertmanager

Owns:

grouping
deduplication
silences
inhibition
notification routing

Does not own:

PromQL evaluation
LogQL evaluation
metric storage
log storage

Route example:

route:
  group_by: ["team", "alertname", "environment"]
  receiver: default
  routes:
    - matchers:
        - team="checkout"
        - severity="page"
      receiver: checkout-pagerduty

Blackbox Exporter

Owns:

synthetic HTTP probes
TCP connect checks
ICMP checks
TLS checks

Prometheus scrape model:

Prometheus
  -> /probe?target=https://example.com&module=http_2xx
  -> Blackbox Exporter
  -> real target URL

PromQL:

probe_success{job="url-monitors"} == 0
probe_duration_seconds{job="url-monitors"} > 2

OpenTelemetry

Owns:

instrumentation APIs
semantic conventions
OTLP protocol
traces, metrics, logs signal model
collector pipeline concepts

Useful when:

you need vendor-neutral instrumentation
traces are first-class
multiple languages/services need consistent telemetry
you want OTLP as a standard transport

Signal Cheat Sheet

Signal	Best for	Query language	Backend
Metrics	rates, counts, SLOs, alerts	PromQL	Prometheus, Mimir, Thanos, Cortex
Logs	events, errors, audit, text search	LogQL	Loki
Traces	request path across services	TraceQL / trace lookup	Tempo
Profiles	code-level resource cost	profile queries	Pyroscope
Synthetic probes	outside-in uptime	PromQL over probe metrics	Blackbox Exporter + Prometheus

Rule Cheat Sheet

Recording Rule

Creates a new metric.

- record: service:http_requests:rate5m
  expr: |
    sum by (service, status) (
      rate(http_requests_total[5m])
    )

Use for:

expensive repeated PromQL
dashboards
SLO base metrics
shared derived metrics

Alerting Rule

Creates an alert.

- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))
    > 0.02
  for: 10m
  labels:
    severity: page
  annotations:
    summary: "High error rate"

Use for:

notifying humans
triggering automation
routeable incident signals

Label Cheat Sheet

Good labels:

service
job
instance
cluster
namespace
team
environment
status
method

Dangerous labels:

user_id
request_id
session_id
raw URL path
full exception text
log message

Rule:

Labels are for bounded dimensions you aggregate or route by.
High-cardinality data belongs in logs, traces, exemplars, or annotations.

Common Architectures

Small Team

Prometheus + Grafana + Alertmanager

Add Loki when logs become important.

Kubernetes Platform

Alloy / exporters
  -> Prometheus or Mimir
  -> Loki
  -> Tempo
  -> Grafana
  -> Alertmanager

Central Observability Platform

Alloy everywhere
  -> Mimir for metrics
  -> Loki for logs
  -> Tempo for traces
  -> Pyroscope for profiles
  -> Grafana for UX
  -> Alertmanager for notifications

Existing Prometheus Fleet

Prometheus per cluster
  -> Thanos sidecar
  -> object storage
  -> Thanos query
  -> Grafana

Final Cheat Sheet

Prometheus scrapes metrics.
Mimir stores central long-term metrics.
Thanos gives global query over Prometheus fleets.
Cortex is the older central multi-tenant metrics backend lineage.
Loki stores logs.
Tempo stores traces.
Pyroscope stores profiles.
Alloy collects and forwards telemetry.
OpenTelemetry standardizes instrumentation and transport.
Blackbox Exporter probes URLs, ports, ICMP, and TLS.
Alertmanager routes notifications.
Grafana is the interface tying it together.

References

← Older

Prometheus: Practical Guide & Mental Model

Newer →

Tokenization in Modern NLP and LLMs

Observability Ecosystem Cheat Sheets: Prometheus, Grafana, Alloy, Loki, Mimir, Thanos, Tempo, and Alertmanager

The Ecosystem in One Mental Model

Fast Decision Cheat Sheet

If You Need Metrics

If You Need Logs

If You Need Traces

If You Need Profiles

If You Need Collection

Tool Cheat Sheets

Prometheus

Grafana

Grafana Alloy

Loki

Mimir

Thanos

Cortex

Alertmanager

Blackbox Exporter

OpenTelemetry

Signal Cheat Sheet

Rule Cheat Sheet

Recording Rule

Alerting Rule

Label Cheat Sheet

Common Architectures

Small Team

Kubernetes Platform

Central Observability Platform

Existing Prometheus Fleet

Final Cheat Sheet

References

Support My Content

Ethereum (ETH)