Observability Ecosystem Cheat Sheets: Prometheus, Grafana, Alloy, Loki, Mimir, Thanos, Tempo, and Alertmanager
The observability ecosystem is powerful because the tools are composable. It is also confusing because every tool sounds like it can do "monitoring."
This is the cheat sheet I wish I had at the beginning: what each tool does, when to use it, what data it owns, how it connects to the others, and which query language/config shape belongs where.
Open the interactive observability ecosystem explorer
The Ecosystem in One Mental Model
Applications, nodes, clusters, jobs
-> emit metrics, logs, traces, profiles
-> collectors/exporters gather telemetry
-> storage backends retain and index it
-> query engines analyze it
-> rules create alerts
-> Alertmanager routes notifications
-> Grafana visualizes and ties it together
The clean split:
| Need | Tool family |
|---|---|
| Metrics scrape and PromQL | Prometheus |
| Central long-term metrics | Mimir, Cortex, Thanos |
| Logs and LogQL | Loki |
| Traces | Tempo |
| Profiles | Pyroscope |
| Collection/agent pipelines | Alloy, OpenTelemetry Collector, exporters |
| Synthetic URL/TCP/ICMP checks | Blackbox Exporter |
| Dashboards and exploration | Grafana |
| Notification routing | Alertmanager |
Fast Decision Cheat Sheet
If You Need Metrics
Use Prometheus when:
- you need local scrape-based metrics
- you want PromQL
- you need alerting rules close to the source
- you are starting small or per-cluster
Use Mimir when:
- you need central multi-tenant long-term metrics
- many Prometheus/Alloy instances remote-write into one backend
- you need global PromQL across clusters
- you need Mimir Ruler for centralized rules
Use Thanos when:
- you already have Prometheus everywhere
- you want global query and object storage around existing Prometheus
- you want to keep cluster-local Prometheus ownership
Use Cortex when:
- you already operate Cortex
- compatibility matters more than greenfield simplicity
- migration to Mimir is not worth it yet
If You Need Logs
Use Loki.
Loki stores labeled log streams. Query with LogQL.
{namespace="prod", app="checkout"} |= "error"
Keep Loki labels low-cardinality. Put request IDs, user IDs, and raw exception bodies in the log line, not stream labels.
If You Need Traces
Use Tempo.
Tempo stores traces and is optimized around trace ID lookup and trace exploration. Traces answer:
What happened across services for this request?
Metrics answer:
How often is this happening?
Logs answer:
What did the service say happened?
If You Need Profiles
Use Pyroscope.
Profiles answer:
Where is CPU, memory, allocation, or lock time going?
Metrics show the symptom. Profiles show where code spends resources.
If You Need Collection
Use Grafana Alloy or OpenTelemetry Collector.
Alloy is useful when you want:
- Prometheus-native metric pipelines
- Loki-native log pipelines
- OTLP receiving/exporting
- one collector for multiple signals
- Grafana ecosystem integration
Common Alloy shape:
prometheus.scrape
-> prometheus.remote_write
-> Mimir / Prometheus / Grafana Cloud Metrics
loki.source.file
-> loki.process
-> loki.write
-> Loki
otelcol.receiver.otlp
-> processors
-> exporters
-> Tempo / Mimir / Loki
Tool Cheat Sheets
Prometheus
Owns:
- scrape config
- local TSDB
- PromQL
- recording rules
- alerting rules
- local alert delivery to Alertmanager
Core config:
global:
scrape_interval: 30s
rule_files:
- /etc/prometheus/rules/*.yaml
scrape_configs:
- job_name: "node"
static_configs:
- targets: ["node-exporter:9100"]
Common PromQL:
up == 0
sum by (service) (
rate(http_requests_total[5m])
)
histogram_quantile(
0.95,
sum by (le, service) (
rate(http_request_duration_seconds_bucket[5m])
)
)
Do:
- keep labels bounded
- use recording rules for repeated expensive queries
- keep local safety alerts in Prometheus
- monitor scrape health and active series
Avoid:
- raw user IDs as labels
- raw URL paths as labels
- huge scrape payloads
- long expensive dashboard queries without recording rules
Grafana
Owns:
- dashboards
- Explore
- data source UX
- alert UI
- correlations
- incident views
Grafana does not store Prometheus metrics or Loki logs by itself. It queries backends.
Typical data sources:
- Prometheus or Mimir for metrics
- Loki for logs
- Tempo for traces
- Pyroscope for profiles
- Alertmanager for alert state
Grafana Alloy
Owns:
- telemetry collection
- scraping
- log tailing
- OTLP receiving
- local processing
- forwarding
Metrics example:
prometheus.remote_write "mimir" {
endpoint {
url = "https://mimir.example.com/api/v1/push"
}
}
prometheus.scrape "apps" {
targets = [{
"__address__" = "checkout:8080",
"job" = "checkout",
}]
forward_to = [prometheus.remote_write.mimir.receiver]
}
Logs example:
loki.write "default" {
endpoint {
url = "https://loki.example.com/loki/api/v1/push"
}
}
loki.source.file "app" {
targets = [{
__path__ = "/var/log/app.log",
job = "checkout",
}]
forward_to = [loki.write.default.receiver]
}
Loki
Owns:
- log ingestion
- log stream indexing
- LogQL
- Loki Ruler
Query examples:
{app="checkout"} |= "error"
sum by (app) (
count_over_time({namespace="prod"} |= "panic" [5m])
)
Do:
- use labels for stable stream identity
- keep high-cardinality data in log body
- use Loki Ruler for log-native alerts
Avoid:
request_idas a Loki labeluser_idas a Loki label- full log message as a label
Mimir
Owns:
- central Prometheus-compatible metrics storage
- tenant isolation
- long retention
- scalable query path
- Mimir Ruler
- Mimir Alertmanager in some deployments
Typical flow:
Prometheus / Alloy / OTel Collector
-> remote_write
-> Mimir distributor
-> ingester
-> object storage
-> store-gateway
-> querier/query-frontend
Use Mimir Ruler for:
- global SLOs
- central recording rules
- tenant-wide alerts
- multi-cluster service health
Thanos
Owns:
- global query across Prometheus instances
- object-storage-backed historical metrics
- deduplication of HA Prometheus pairs
- store gateway and compaction
Common flow:
Prometheus
-> Thanos sidecar
-> object storage
-> Thanos store gateway
-> Thanos query
-> Grafana
Choose Thanos when you want to keep Prometheus local and add a global query layer.
Cortex
Owns:
- central remote-write metrics backend
- multi-tenancy
- Prometheus-compatible querying
- Cortex Ruler and Alertmanager integrations
For most greenfield Grafana ecosystem deployments, compare Cortex carefully against Mimir. Cortex is still relevant where it is already deployed or compatibility is required.
Alertmanager
Owns:
- grouping
- deduplication
- silences
- inhibition
- notification routing
Does not own:
- PromQL evaluation
- LogQL evaluation
- metric storage
- log storage
Route example:
route:
group_by: ["team", "alertname", "environment"]
receiver: default
routes:
- matchers:
- team="checkout"
- severity="page"
receiver: checkout-pagerduty
Blackbox Exporter
Owns:
- synthetic HTTP probes
- TCP connect checks
- ICMP checks
- TLS checks
Prometheus scrape model:
Prometheus
-> /probe?target=https://example.com&module=http_2xx
-> Blackbox Exporter
-> real target URL
PromQL:
probe_success{job="url-monitors"} == 0
probe_duration_seconds{job="url-monitors"} > 2
OpenTelemetry
Owns:
- instrumentation APIs
- semantic conventions
- OTLP protocol
- traces, metrics, logs signal model
- collector pipeline concepts
Useful when:
- you need vendor-neutral instrumentation
- traces are first-class
- multiple languages/services need consistent telemetry
- you want OTLP as a standard transport
Signal Cheat Sheet
| Signal | Best for | Query language | Backend |
|---|---|---|---|
| Metrics | rates, counts, SLOs, alerts | PromQL | Prometheus, Mimir, Thanos, Cortex |
| Logs | events, errors, audit, text search | LogQL | Loki |
| Traces | request path across services | TraceQL / trace lookup | Tempo |
| Profiles | code-level resource cost | profile queries | Pyroscope |
| Synthetic probes | outside-in uptime | PromQL over probe metrics | Blackbox Exporter + Prometheus |
Rule Cheat Sheet
Recording Rule
Creates a new metric.
- record: service:http_requests:rate5m
expr: |
sum by (service, status) (
rate(http_requests_total[5m])
)
Use for:
- expensive repeated PromQL
- dashboards
- SLO base metrics
- shared derived metrics
Alerting Rule
Creates an alert.
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.02
for: 10m
labels:
severity: page
annotations:
summary: "High error rate"
Use for:
- notifying humans
- triggering automation
- routeable incident signals
Label Cheat Sheet
Good labels:
servicejobinstanceclusternamespaceteamenvironmentstatusmethod
Dangerous labels:
user_idrequest_idsession_id- raw URL path
- full exception text
- log message
Rule:
Labels are for bounded dimensions you aggregate or route by.
High-cardinality data belongs in logs, traces, exemplars, or annotations.
Common Architectures
Small Team
Prometheus + Grafana + Alertmanager
Add Loki when logs become important.
Kubernetes Platform
Alloy / exporters
-> Prometheus or Mimir
-> Loki
-> Tempo
-> Grafana
-> Alertmanager
Central Observability Platform
Alloy everywhere
-> Mimir for metrics
-> Loki for logs
-> Tempo for traces
-> Pyroscope for profiles
-> Grafana for UX
-> Alertmanager for notifications
Existing Prometheus Fleet
Prometheus per cluster
-> Thanos sidecar
-> object storage
-> Thanos query
-> Grafana
Final Cheat Sheet
- Prometheus scrapes metrics.
- Mimir stores central long-term metrics.
- Thanos gives global query over Prometheus fleets.
- Cortex is the older central multi-tenant metrics backend lineage.
- Loki stores logs.
- Tempo stores traces.
- Pyroscope stores profiles.
- Alloy collects and forwards telemetry.
- OpenTelemetry standardizes instrumentation and transport.
- Blackbox Exporter probes URLs, ports, ICMP, and TLS.
- Alertmanager routes notifications.
- Grafana is the interface tying it together.