Grafana Alloy, Fleet Management, Loki, Mimir, and Alerting
Grafana Alloy is the collector layer in a Grafana observability architecture. It can scrape Prometheus metrics, collect logs, receive OpenTelemetry data, transform labels, fan out pipelines, and forward telemetry to backends like Mimir, Prometheus, Loki, Tempo, and Grafana Cloud.
The important mental model is this:
Fleet Management controls collector config.
Alloy collects and forwards telemetry.
Loki stores logs.
Prometheus or Mimir stores metrics.
Rulers evaluate rules.
Alertmanager routes notifications.
Grafana visualizes and manages the experience.
Open the interactive Alloy alerting architecture explorer
The Architecture in One View
Grafana Fleet Management
-> remote configuration pipelines
-> Alloy collectors on nodes, clusters, VMs, or services
Alloy metrics path:
targets / exporters / apps
-> prometheus.scrape
-> relabeling or processing
-> prometheus.remote_write
-> Prometheus, Mimir, or Grafana Cloud Metrics
Alloy logs path:
files / pods / Kubernetes events / syslog
-> loki.source.*
-> loki.process
-> loki.write
-> Loki or Grafana Cloud Logs
Alerting path:
Prometheus rules or Mimir Ruler or Loki Ruler or Grafana-managed alerts
-> Alertmanager
-> notification policies
-> Slack, PagerDuty, email, webhook, incident tools
Alloy does not replace Mimir, Loki, Prometheus, or Alertmanager. It sits at the edge and makes telemetry collection programmable.
What Fleet Management Adds
Without Fleet Management, every Alloy instance has local configuration. That works for small environments, but it gets painful when you have hundreds or thousands of collectors.
Fleet Management changes the control plane:
- Collectors register with a
remotecfgblock. - Each collector has an ID and attributes like
cluster,namespace,team, orowner. - Grafana Cloud Fleet Management assigns configuration pipelines based on those attributes.
- Alloy periodically polls for its remote configuration.
- The fleet can be monitored centrally.
Minimal shape:
remotecfg {
url = "https://fleet-management.example.grafana.net"
basic_auth {
username = "collector"
password_file = "/etc/alloy/fleet-token"
}
id = constants.hostname
attributes = {
cluster = "prod-us-east-1"
team = "platform"
}
poll_frequency = "5m"
}
The local file becomes the bootstrap. The real pipelines can be assigned remotely.
Alloy as the Metrics Collector
For metrics, Alloy usually behaves like a programmable Prometheus scraper plus remote-write client.
prometheus.remote_write "mimir" {
endpoint {
url = "https://mimir.example.com/api/v1/push"
headers = {
"X-Scope-OrgID" = "platform-prod"
}
}
}
prometheus.scrape "apps" {
targets = [
{
"__address__" = "checkout.default.svc.cluster.local:8080",
"job" = "checkout",
"cluster" = "prod-us-east-1",
},
]
scrape_interval = "30s"
forward_to = [prometheus.remote_write.mimir.receiver]
}
What happens:
- Alloy scrapes
/metrics. - It turns the response into Prometheus samples.
- It writes samples into the
prometheus.remote_writecomponent. - Remote write keeps a WAL so temporary network failures do not immediately lose data.
- Samples are forwarded to Mimir, Prometheus remote-write receiver, or Grafana Cloud Metrics.
Alloy as the Logs Collector
For logs, Alloy pipelines usually look like:
discover files or pods
-> read log lines
-> parse, label, redact, or drop
-> write to Loki
Example:
loki.write "default" {
endpoint {
url = "https://logs.example.com/loki/api/v1/push"
}
}
local.file_match "node_logs" {
path_targets = [{
__path__ = "/var/log/syslog",
job = "node/syslog",
cluster = "prod-us-east-1",
node_name = constants.hostname,
}]
}
loki.source.file "node_logs" {
targets = local.file_match.node_logs.targets
forward_to = [loki.process.node_logs.receiver]
}
loki.process "node_logs" {
stage.static_labels {
values = {
env = "prod"
}
}
forward_to = [loki.write.default.receiver]
}
Loki is not a metrics database. It stores indexed log streams. The label design matters: labels should identify streams, not every request, user ID, trace ID, or message fragment. High-cardinality fields should usually stay in the log body and be extracted at query time.
Where Alertmanager Fits
Alertmanager is the notification router. It does not decide whether CPU is high, whether error rate is bad, or whether a log line is dangerous. Rule engines decide that.
Alertmanager handles:
- grouping
- deduplication
- silences
- inhibition
- routing
- receiver integrations
- notification templates
The alert engines are usually one of these:
- Prometheus rule engine
- Mimir Ruler
- Loki Ruler
- Grafana-managed alerting
All of them can end by sending firing alerts to Alertmanager.
Prometheus Rules vs Mimir Ruler
Prometheus and Mimir use the same PromQL rule model, but the responsibility boundary is different.
Prometheus Rules Excel When
Use Prometheus rules when alerting should stay local to the Prometheus server.
They are excellent for:
- single-cluster alerting
- node, pod, kubelet, API server, and local service health
- bootstrap alerts where the central backend might be down
- alerts that should fire even if remote-write is delayed
- teams that manage their own Prometheus and rules
- simple recording rules close to the scrape source
Example:
groups:
- name: checkout-local
interval: 30s
rules:
- alert: CheckoutHighErrorRate
expr: |
sum(rate(http_requests_total{job="checkout",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="checkout"}[5m]))
> 0.02
for: 10m
labels:
severity: page
team: checkout
annotations:
summary: "Checkout has high 5xx rate"
This is ideal when the query is local, the blast radius is local, and you want failure isolation.
Mimir Ruler Excels When
Use Mimir Ruler when alerting should run against centralized, long-retention, multi-tenant metrics.
It is excellent for:
- tenant-scoped rule management
- centrally governed alerting
- global service SLOs across clusters
- recording rules that should be stored back into Mimir
- rules over long-retention data
- platform-owned rules shared across many teams
- large environments where Prometheus is mostly a scraper and remote-write client
Example:
groups:
- name: checkout-global
interval: 1m
rules:
- record: job:http_requests:rate5m
expr: sum by (tenant, cluster, job, status) (rate(http_requests_total[5m]))
- alert: CheckoutGlobalErrorBudgetBurn
expr: |
sum(job:http_requests:rate5m{job="checkout",status=~"5.."})
/
sum(job:http_requests:rate5m{job="checkout"})
> 0.01
for: 15m
labels:
severity: page
team: checkout
annotations:
summary: "Checkout is burning global error budget"
This is ideal when the signal spans clusters, regions, teams, or tenants.
Rule Placement Decision
| Rule type | Prefer Prometheus | Prefer Mimir Ruler |
|---|---|---|
| Node down in one cluster | Yes | Usually no |
| Kubernetes API server down | Yes | Usually no |
| Global checkout SLO | No | Yes |
| Multi-cluster error budget burn | No | Yes |
| Central recording rules for dashboards | Sometimes | Yes |
| Alert if remote-write is failing | Yes | No |
| Tenant-wide quota or ingestion health | No | Yes |
| Bootstrap platform health | Yes | Sometimes |
The best systems often use both.
Prometheus handles local safety alerts. Mimir handles central product/platform alerts.
Loki Ruler: Alerting From Logs
Loki Ruler evaluates LogQL rules. It is useful when the source of truth is a log event rather than a metric.
Good fits:
- leaked credentials in logs
- panic or fatal errors
- audit events
- security events
- rare events that are not worth turning into permanent high-cardinality metrics
- black-box behavior visible only in logs
Example:
groups:
- name: security-log-alerts
rules:
- alert: CredentialsInLogs
expr: |
sum by (cluster, namespace, pod) (
count_over_time({namespace="prod"} |~ "http(s?)://[^ ]+:[^ ]+@" [5m])
) > 0
for: 10m
labels:
severity: critical
team: security
annotations:
summary: "Possible credentials leaked in logs"
Loki recording rules can also generate metrics and remote-write those results to a Prometheus-compatible backend such as Mimir.
That pattern is powerful:
logs in Loki
-> Loki Ruler recording rule
-> generated metric
-> remote_write to Mimir
-> Mimir Ruler or Grafana dashboards use it later
Recommended Production Layout
For a mature Grafana stack, a clean architecture is:
Fleet Management
-> manages Alloy collector config
Alloy on every node/cluster
-> scrapes metrics
-> tails logs
-> adds cluster/team/env labels
-> remote-writes metrics to Mimir
-> writes logs to Loki
Mimir
-> stores long-term metrics
-> evaluates global rules with Mimir Ruler
-> sends alerts to Alertmanager
Prometheus
-> optional local scrape and local rule engine
-> catches local/bootstrap failures
-> can remote-write to Mimir
Loki
-> stores logs
-> Loki Ruler evaluates log alerts
-> sends alerts to Alertmanager
-> recording rules can write metrics to Mimir
Alertmanager
-> groups, dedupes, silences, inhibits, and routes alerts
Grafana
-> dashboards, Explore, alert views, rule management, incident workflow
The Most Important Labels
Good labels make the whole system work.
Use stable labels:
clusternamespacejobserviceteamenvregiontenant
Avoid high-cardinality labels:
- request IDs
- trace IDs
- user IDs
- session IDs
- full URL paths with IDs
- raw exception messages
For metrics, high cardinality can make Mimir expensive and slow. For logs, high-cardinality labels can make Loki index-heavy. Keep high-cardinality values in log bodies or exemplars, not stream labels.
Example End-to-End Flow
Imagine checkout starts returning 5xx responses.
- Alloy scrapes checkout metrics every 30 seconds.
- Alloy remote-writes
http_requests_totalto Mimir. - Mimir ingests the samples for tenant
checkout-prod. - Mimir Ruler evaluates the global error-rate rule.
- The alert enters pending for 15 minutes.
- The alert becomes firing.
- Mimir Ruler sends it to Alertmanager.
- Alertmanager groups it with related checkout alerts.
- Alertmanager routes it to PagerDuty and Slack.
- Grafana dashboards show Mimir metrics and Loki logs for the same service labels.
If the problem is a panic in logs:
- Alloy tails pod logs.
- Alloy writes logs to Loki.
- Loki Ruler evaluates a LogQL rule.
- Loki sends the firing alert to Alertmanager.
- Alertmanager routes the notification.
Common Mistakes
Mistake 1: Putting All Alerts in Mimir
Do not move every alert to Mimir just because Mimir exists.
Keep local Prometheus alerts for things that must work during backend or network trouble:
- remote-write failing
- local node pressure
- kubelet down
- local scrape health
- critical platform bootstrap checks
Mistake 2: Putting Every Label on Every Signal
Do not copy all Kubernetes metadata into all metrics and logs.
Label only what you query, route, or aggregate by. Everything else belongs in annotations, log body, exemplars, traces, or metadata systems.
Mistake 3: Treating Alertmanager as the Rule Engine
Alertmanager routes alerts. It does not run PromQL or LogQL.
Prometheus, Mimir Ruler, Loki Ruler, or Grafana Alerting evaluate the data. Alertmanager handles notification behavior.
Mistake 4: Using Loki for Metrics
Loki can produce metrics from logs, but it should not replace normal metrics instrumentation. If a value is needed continuously for dashboards, SLOs, or autoscaling, expose it as a metric and send it to Prometheus/Mimir.
Use Loki-derived metrics for signals that naturally originate in logs.
Operating Checklist
For Alloy:
- Scrape Alloy's own health metrics.
- Watch remote-write queue and WAL metrics.
- Keep stable
cluster,env,team, andservicelabels. - Use Fleet Management attributes intentionally.
- Avoid duplicate scrapes from multiple Alloy instances unless clustering or sharding is configured.
For Mimir:
- Set tenant limits.
- Monitor ingester memory and WAL.
- Monitor ruler evaluation duration and missed evaluations.
- Keep recording rules from producing uncontrolled cardinality.
- Route ruler alerts to the correct Alertmanager URL.
For Loki:
- Keep labels low-cardinality.
- Monitor ingestion rate, rejected samples, query latency, and ruler WAL health.
- Use Loki Ruler for log-native alerts.
- Remote-write Loki recording-rule metrics to Mimir only when those metrics are useful later.
For Alertmanager:
- Design routes by
team,severity,service, andenv. - Use inhibition to suppress noisy symptom alerts during root-cause alerts.
- Test silences and notification templates.
- Run HA Alertmanager for production notification reliability.
Final Architecture Rule
Use Alloy for collection and forwarding.
Use Loki for logs.
Use Prometheus for local scrape and local safety rules.
Use Mimir for centralized metrics and global rules.
Use Loki Ruler for log-native rules.
Use Alertmanager for notification routing.
Use Fleet Management when local collector config becomes a fleet-scale operations problem.
References
- Grafana Fleet Management architecture
- Grafana Alloy
prometheus.scrape - Grafana Alloy
prometheus.remote_write - Grafana Alloy logs to Loki
- Grafana Mimir Ruler
- Grafana Mimir Alertmanager
- Grafana Loki alerting and recording rules
- Grafana Loki recording rules
- Prometheus alerting rules
- Prometheus recording rules
- Prometheus Alertmanager overview