May 30, 2026

Grafana Alloy, Fleet Management, Loki, Mimir, and Alerting

Grafana Alloy is the collector layer in a Grafana observability architecture. It can scrape Prometheus metrics, collect logs, receive OpenTelemetry data, transform labels, fan out pipelines, and forward telemetry to backends like Mimir, Prometheus, Loki, Tempo, and Grafana Cloud.

The important mental model is this:

Fleet Management controls collector config.
Alloy collects and forwards telemetry.
Loki stores logs.
Prometheus or Mimir stores metrics.
Rulers evaluate rules.
Alertmanager routes notifications.
Grafana visualizes and manages the experience.

Open the interactive Alloy alerting architecture explorer

The Architecture in One View

Grafana Fleet Management
  -> remote configuration pipelines
  -> Alloy collectors on nodes, clusters, VMs, or services

Alloy metrics path:
  targets / exporters / apps
  -> prometheus.scrape
  -> relabeling or processing
  -> prometheus.remote_write
  -> Prometheus, Mimir, or Grafana Cloud Metrics

Alloy logs path:
  files / pods / Kubernetes events / syslog
  -> loki.source.*
  -> loki.process
  -> loki.write
  -> Loki or Grafana Cloud Logs

Alerting path:
  Prometheus rules or Mimir Ruler or Loki Ruler or Grafana-managed alerts
  -> Alertmanager
  -> notification policies
  -> Slack, PagerDuty, email, webhook, incident tools

Alloy does not replace Mimir, Loki, Prometheus, or Alertmanager. It sits at the edge and makes telemetry collection programmable.

What Fleet Management Adds

Without Fleet Management, every Alloy instance has local configuration. That works for small environments, but it gets painful when you have hundreds or thousands of collectors.

Fleet Management changes the control plane:

  • Collectors register with a remotecfg block.
  • Each collector has an ID and attributes like cluster, namespace, team, or owner.
  • Grafana Cloud Fleet Management assigns configuration pipelines based on those attributes.
  • Alloy periodically polls for its remote configuration.
  • The fleet can be monitored centrally.

Minimal shape:

remotecfg {
  url = "https://fleet-management.example.grafana.net"

  basic_auth {
    username      = "collector"
    password_file = "/etc/alloy/fleet-token"
  }

  id             = constants.hostname
  attributes     = {
    cluster = "prod-us-east-1"
    team    = "platform"
  }
  poll_frequency = "5m"
}

The local file becomes the bootstrap. The real pipelines can be assigned remotely.

Alloy as the Metrics Collector

For metrics, Alloy usually behaves like a programmable Prometheus scraper plus remote-write client.

prometheus.remote_write "mimir" {
  endpoint {
    url = "https://mimir.example.com/api/v1/push"

    headers = {
      "X-Scope-OrgID" = "platform-prod"
    }
  }
}

prometheus.scrape "apps" {
  targets = [
    {
      "__address__" = "checkout.default.svc.cluster.local:8080",
      "job"         = "checkout",
      "cluster"     = "prod-us-east-1",
    },
  ]

  scrape_interval = "30s"
  forward_to      = [prometheus.remote_write.mimir.receiver]
}

What happens:

  1. Alloy scrapes /metrics.
  2. It turns the response into Prometheus samples.
  3. It writes samples into the prometheus.remote_write component.
  4. Remote write keeps a WAL so temporary network failures do not immediately lose data.
  5. Samples are forwarded to Mimir, Prometheus remote-write receiver, or Grafana Cloud Metrics.

Alloy as the Logs Collector

For logs, Alloy pipelines usually look like:

discover files or pods
  -> read log lines
  -> parse, label, redact, or drop
  -> write to Loki

Example:

loki.write "default" {
  endpoint {
    url = "https://logs.example.com/loki/api/v1/push"
  }
}

local.file_match "node_logs" {
  path_targets = [{
    __path__  = "/var/log/syslog",
    job       = "node/syslog",
    cluster   = "prod-us-east-1",
    node_name = constants.hostname,
  }]
}

loki.source.file "node_logs" {
  targets    = local.file_match.node_logs.targets
  forward_to = [loki.process.node_logs.receiver]
}

loki.process "node_logs" {
  stage.static_labels {
    values = {
      env = "prod"
    }
  }

  forward_to = [loki.write.default.receiver]
}

Loki is not a metrics database. It stores indexed log streams. The label design matters: labels should identify streams, not every request, user ID, trace ID, or message fragment. High-cardinality fields should usually stay in the log body and be extracted at query time.

Where Alertmanager Fits

Alertmanager is the notification router. It does not decide whether CPU is high, whether error rate is bad, or whether a log line is dangerous. Rule engines decide that.

Alertmanager handles:

  • grouping
  • deduplication
  • silences
  • inhibition
  • routing
  • receiver integrations
  • notification templates

The alert engines are usually one of these:

  • Prometheus rule engine
  • Mimir Ruler
  • Loki Ruler
  • Grafana-managed alerting

All of them can end by sending firing alerts to Alertmanager.

Prometheus Rules vs Mimir Ruler

Prometheus and Mimir use the same PromQL rule model, but the responsibility boundary is different.

Prometheus Rules Excel When

Use Prometheus rules when alerting should stay local to the Prometheus server.

They are excellent for:

  • single-cluster alerting
  • node, pod, kubelet, API server, and local service health
  • bootstrap alerts where the central backend might be down
  • alerts that should fire even if remote-write is delayed
  • teams that manage their own Prometheus and rules
  • simple recording rules close to the scrape source

Example:

groups:
  - name: checkout-local
    interval: 30s
    rules:
      - alert: CheckoutHighErrorRate
        expr: |
          sum(rate(http_requests_total{job="checkout",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{job="checkout"}[5m]))
          > 0.02
        for: 10m
        labels:
          severity: page
          team: checkout
        annotations:
          summary: "Checkout has high 5xx rate"

This is ideal when the query is local, the blast radius is local, and you want failure isolation.

Mimir Ruler Excels When

Use Mimir Ruler when alerting should run against centralized, long-retention, multi-tenant metrics.

It is excellent for:

  • tenant-scoped rule management
  • centrally governed alerting
  • global service SLOs across clusters
  • recording rules that should be stored back into Mimir
  • rules over long-retention data
  • platform-owned rules shared across many teams
  • large environments where Prometheus is mostly a scraper and remote-write client

Example:

groups:
  - name: checkout-global
    interval: 1m
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (tenant, cluster, job, status) (rate(http_requests_total[5m]))

      - alert: CheckoutGlobalErrorBudgetBurn
        expr: |
          sum(job:http_requests:rate5m{job="checkout",status=~"5.."})
          /
          sum(job:http_requests:rate5m{job="checkout"})
          > 0.01
        for: 15m
        labels:
          severity: page
          team: checkout
        annotations:
          summary: "Checkout is burning global error budget"

This is ideal when the signal spans clusters, regions, teams, or tenants.

Rule Placement Decision

Rule typePrefer PrometheusPrefer Mimir Ruler
Node down in one clusterYesUsually no
Kubernetes API server downYesUsually no
Global checkout SLONoYes
Multi-cluster error budget burnNoYes
Central recording rules for dashboardsSometimesYes
Alert if remote-write is failingYesNo
Tenant-wide quota or ingestion healthNoYes
Bootstrap platform healthYesSometimes

The best systems often use both.

Prometheus handles local safety alerts. Mimir handles central product/platform alerts.

Loki Ruler: Alerting From Logs

Loki Ruler evaluates LogQL rules. It is useful when the source of truth is a log event rather than a metric.

Good fits:

  • leaked credentials in logs
  • panic or fatal errors
  • audit events
  • security events
  • rare events that are not worth turning into permanent high-cardinality metrics
  • black-box behavior visible only in logs

Example:

groups:
  - name: security-log-alerts
    rules:
      - alert: CredentialsInLogs
        expr: |
          sum by (cluster, namespace, pod) (
            count_over_time({namespace="prod"} |~ "http(s?)://[^ ]+:[^ ]+@" [5m])
          ) > 0
        for: 10m
        labels:
          severity: critical
          team: security
        annotations:
          summary: "Possible credentials leaked in logs"

Loki recording rules can also generate metrics and remote-write those results to a Prometheus-compatible backend such as Mimir.

That pattern is powerful:

logs in Loki
  -> Loki Ruler recording rule
  -> generated metric
  -> remote_write to Mimir
  -> Mimir Ruler or Grafana dashboards use it later

Recommended Production Layout

For a mature Grafana stack, a clean architecture is:

Fleet Management
  -> manages Alloy collector config

Alloy on every node/cluster
  -> scrapes metrics
  -> tails logs
  -> adds cluster/team/env labels
  -> remote-writes metrics to Mimir
  -> writes logs to Loki

Mimir
  -> stores long-term metrics
  -> evaluates global rules with Mimir Ruler
  -> sends alerts to Alertmanager

Prometheus
  -> optional local scrape and local rule engine
  -> catches local/bootstrap failures
  -> can remote-write to Mimir

Loki
  -> stores logs
  -> Loki Ruler evaluates log alerts
  -> sends alerts to Alertmanager
  -> recording rules can write metrics to Mimir

Alertmanager
  -> groups, dedupes, silences, inhibits, and routes alerts

Grafana
  -> dashboards, Explore, alert views, rule management, incident workflow

The Most Important Labels

Good labels make the whole system work.

Use stable labels:

  • cluster
  • namespace
  • job
  • service
  • team
  • env
  • region
  • tenant

Avoid high-cardinality labels:

  • request IDs
  • trace IDs
  • user IDs
  • session IDs
  • full URL paths with IDs
  • raw exception messages

For metrics, high cardinality can make Mimir expensive and slow. For logs, high-cardinality labels can make Loki index-heavy. Keep high-cardinality values in log bodies or exemplars, not stream labels.

Example End-to-End Flow

Imagine checkout starts returning 5xx responses.

  1. Alloy scrapes checkout metrics every 30 seconds.
  2. Alloy remote-writes http_requests_total to Mimir.
  3. Mimir ingests the samples for tenant checkout-prod.
  4. Mimir Ruler evaluates the global error-rate rule.
  5. The alert enters pending for 15 minutes.
  6. The alert becomes firing.
  7. Mimir Ruler sends it to Alertmanager.
  8. Alertmanager groups it with related checkout alerts.
  9. Alertmanager routes it to PagerDuty and Slack.
  10. Grafana dashboards show Mimir metrics and Loki logs for the same service labels.

If the problem is a panic in logs:

  1. Alloy tails pod logs.
  2. Alloy writes logs to Loki.
  3. Loki Ruler evaluates a LogQL rule.
  4. Loki sends the firing alert to Alertmanager.
  5. Alertmanager routes the notification.

Common Mistakes

Mistake 1: Putting All Alerts in Mimir

Do not move every alert to Mimir just because Mimir exists.

Keep local Prometheus alerts for things that must work during backend or network trouble:

  • remote-write failing
  • local node pressure
  • kubelet down
  • local scrape health
  • critical platform bootstrap checks

Mistake 2: Putting Every Label on Every Signal

Do not copy all Kubernetes metadata into all metrics and logs.

Label only what you query, route, or aggregate by. Everything else belongs in annotations, log body, exemplars, traces, or metadata systems.

Mistake 3: Treating Alertmanager as the Rule Engine

Alertmanager routes alerts. It does not run PromQL or LogQL.

Prometheus, Mimir Ruler, Loki Ruler, or Grafana Alerting evaluate the data. Alertmanager handles notification behavior.

Mistake 4: Using Loki for Metrics

Loki can produce metrics from logs, but it should not replace normal metrics instrumentation. If a value is needed continuously for dashboards, SLOs, or autoscaling, expose it as a metric and send it to Prometheus/Mimir.

Use Loki-derived metrics for signals that naturally originate in logs.

Operating Checklist

For Alloy:

  • Scrape Alloy's own health metrics.
  • Watch remote-write queue and WAL metrics.
  • Keep stable cluster, env, team, and service labels.
  • Use Fleet Management attributes intentionally.
  • Avoid duplicate scrapes from multiple Alloy instances unless clustering or sharding is configured.

For Mimir:

  • Set tenant limits.
  • Monitor ingester memory and WAL.
  • Monitor ruler evaluation duration and missed evaluations.
  • Keep recording rules from producing uncontrolled cardinality.
  • Route ruler alerts to the correct Alertmanager URL.

For Loki:

  • Keep labels low-cardinality.
  • Monitor ingestion rate, rejected samples, query latency, and ruler WAL health.
  • Use Loki Ruler for log-native alerts.
  • Remote-write Loki recording-rule metrics to Mimir only when those metrics are useful later.

For Alertmanager:

  • Design routes by team, severity, service, and env.
  • Use inhibition to suppress noisy symptom alerts during root-cause alerts.
  • Test silences and notification templates.
  • Run HA Alertmanager for production notification reliability.

Final Architecture Rule

Use Alloy for collection and forwarding.

Use Loki for logs.

Use Prometheus for local scrape and local safety rules.

Use Mimir for centralized metrics and global rules.

Use Loki Ruler for log-native rules.

Use Alertmanager for notification routing.

Use Fleet Management when local collector config becomes a fleet-scale operations problem.

References


Thanks for reading! If you want to see future content, you can follow me on Twitter or get connected over at LinkedIn.


Support My Content

If you find my content helpful, consider supporting a humanitarian cause (building homes for elderly people in rural Terai region of Nepal) that I am planning with your donation:

Ethereum (ETH)

0xB62409A5B227D2aE7D8C66fdaA5EEf4eB4E37959

Thank you for your support!