Grafana + Prometheus: The Complete Self-Hosted Monitoring Stack

Grafana Prometheus Monitoring Architecture

Look, I'm going to be real with you. If you're not monitoring your infrastructure, you're basically flying blind. And when something breaks at 3am (it will), you'll wish you had set this up months ago.

The good news? Getting production-grade monitoring running takes about the same time as your morning coffee. Here's how to stop paying DataDog $200/month and run the same thing yourself for ~$15.

Why Prometheus + Grafana (And Why Together)

Prometheus collects metrics. Grafana visualizes them. They're the peanut butter and jelly of self-hosted monitoring, and they've been battle-tested by companies running infrastructure way bigger than yours or mine.

What you get:

  • Real-time system metrics (CPU, RAM, disk, network)
  • Custom application metrics (request latency, error rates, business KPIs)
  • Alerting that actually works (Slack, email, PagerDuty)
  • Time-series data storage with efficient compression
  • Beautiful dashboards that make you look like you know what you're doing

The best part? Both are open-source with no per-user fees or license costs. Your infrastructure bill stays predictable as you scale.

The Full Stack Setup

Here's the docker-compose.yml that gets everything running. No magic, no vendor lock-in, just containers and configuration:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "172.17.0.1:9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
    restart: always

  grafana:
    image: grafana/grafana:latest
    ports:
      - "172.17.0.1:3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your-secure-password
      - GF_SERVER_ROOT_URL=https://your-domain.com
    restart: always

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "172.17.0.1:9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    restart: always

volumes:
  prometheus-data:
  grafana-data:

What's happening here:

  • Prometheus scrapes metrics every 15 seconds and stores them for 30 days
  • Grafana connects to Prometheus as a data source and visualizes everything
  • Node Exporter exposes system metrics (CPU, RAM, disk) that Prometheus collects

Prometheus Configuration

Create prometheus.yml in your project directory:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  # Add your app metrics here
  - job_name: 'my-app'
    static_configs:
      - targets: ['172.17.0.1:8080']

Prometheus will now scrape metrics from itself, the node exporter, and any app you point it at (assuming your app exposes metrics at /metrics).

Setting Up Grafana Dashboards

Once your stack is running, access Grafana at https://your-domain.com:3000:

Add Prometheus data source:

  • Settings → Data Sources → Add Prometheus
  • URL: http://prometheus:9090
  • Save & Test

Import pre-built dashboard:

  • Dashboards → Import → Enter ID 1860 (Node Exporter Full)
  • Select Prometheus data source
  • Click Import

You now have a production-ready dashboard showing CPU, RAM, disk, network, and system load. Took about 2 minutes.

The Part Everyone Messes Up: Alerting

Metrics without alerts are just pretty graphs. Here's how to get notified 30 minutes before your server crashes:

Add to your prometheus.yml:

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - 'alert.rules.yml'

Create alert.rules.yml:

groups:
  - name: system_alerts
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage is above 80% for 5 minutes"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        annotations:
          summary: "High memory usage detected"
          description: "Memory usage is above 85% for 5 minutes"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
        for: 10m
        annotations:
          summary: "Disk space running low"
          description: "Less than 15% disk space remaining"

These alerts fire when CPU hits 80%, memory hits 85%, or disk space drops below 15%. Adjust thresholds based on your setup.

Cost Comparison: SaaS vs Self-Hosted

Let's be honest about the money:

Expense DataDog/New Relic Self-Hosted (Elestio)
License/Platform Fee $100-200/month $0 (open-source)
Infrastructure Included $15/month (2 CPU / 4GB RAM)
Per-Host Fees $15-31/host $0
Total (5 hosts) $175-355/month $15/month
Annual Savings - $1,920-4,080

Running this on Elestio costs about $15/month for a 2 CPU / 4GB RAM instance. That handles monitoring for 10-20 servers easily. DataDog charges you per host, per metric, and per user. The bill gets stupid fast.

Troubleshooting Common Issues

Prometheus not scraping targets:

  • Check Status → Targets in Prometheus UI
  • Verify network connectivity: curl http://node-exporter:9100/metrics
  • Check firewall rules aren't blocking scrape ports

Grafana dashboard shows "No Data":

  • Verify Prometheus data source connection (Settings → Data Sources → Test)
  • Check Prometheus is successfully scraping: Status → Targets should show "UP"
  • Verify metric names in dashboard queries match your Prometheus metrics

Disk fills up quickly:

  • Prometheus default retention is 15 days - adjust --storage.tsdb.retention.time
  • Configure metric relabeling to drop high-cardinality labels
  • Consider using remote storage for long-term retention

Alerts not firing:

  • Check Alertmanager is running and reachable
  • Verify alert rules syntax: promtool check rules alert.rules.yml
  • Check Prometheus logs for evaluation errors

Deploy on Elestio (The Easy Way)

If you don't want to manage this yourself:

  1. Create Grafana instance on Elestio
  2. Create Prometheus instance on Elestio
  3. Both come pre-configured with SSL, backups, and automatic updates
  4. Connect Grafana to Prometheus (Elestio provides internal URLs)
  5. Import dashboards and you're done

The whole thing takes maybe 10 minutes, and you get professional infrastructure without the ops overhead.

What You've Built

You now have:

  • Production-grade monitoring stack collecting system and application metrics
  • Real-time dashboards visualizing CPU, RAM, disk, network, and custom metrics
  • Proactive alerts that warn you before disasters happen
  • Complete control over your data (no vendor lock-in)
  • ~$2,000-4,000/year saved compared to SaaS alternatives

This setup scales from side projects to serious production workloads. Add more exporters for databases, web servers, or custom apps. Build dashboards for business metrics. Set up federated Prometheus for multi-datacenter monitoring.

The infrastructure monitoring problem? Solved. Now go fix the bugs your dashboards are about to reveal.


Deploy Grafana on Elestio: Get Started
Deploy Prometheus on Elestio: Get Started

Thanks for reading ❤️