Grafana

Grafana + Prometheus: The Complete Self-Hosted Monitoring Stack

admin

17 Nov 2025 — 4 min read

Grafana Prometheus Monitoring Architecture

Look, I'm going to be real with you. If you're not monitoring your infrastructure, you're basically flying blind. And when something breaks at 3am (it will), you'll wish you had set this up months ago.

The good news? Getting production-grade monitoring running takes about the same time as your morning coffee. Here's how to stop paying DataDog $200/month and run the same thing yourself for ~$15.

Why Prometheus + Grafana (And Why Together)

Prometheus collects metrics. Grafana visualizes them. They're the peanut butter and jelly of self-hosted monitoring, and they've been battle-tested by companies running infrastructure way bigger than yours or mine.

What you get:

Real-time system metrics (CPU, RAM, disk, network)
Custom application metrics (request latency, error rates, business KPIs)
Alerting that actually works (Slack, email, PagerDuty)
Time-series data storage with efficient compression
Beautiful dashboards that make you look like you know what you're doing

The best part? Both are open-source with no per-user fees or license costs. Your infrastructure bill stays predictable as you scale.

The Full Stack Setup

Here's the docker-compose.yml that gets everything running. No magic, no vendor lock-in, just containers and configuration:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "172.17.0.1:9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
    restart: always

  grafana:
    image: grafana/grafana:latest
    ports:
      - "172.17.0.1:3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your-secure-password
      - GF_SERVER_ROOT_URL=https://your-domain.com
    restart: always

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "172.17.0.1:9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    restart: always

volumes:
  prometheus-data:
  grafana-data:

What's happening here:

Prometheus scrapes metrics every 15 seconds and stores them for 30 days
Grafana connects to Prometheus as a data source and visualizes everything
Node Exporter exposes system metrics (CPU, RAM, disk) that Prometheus collects

Prometheus Configuration

Create prometheus.yml in your project directory:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  # Add your app metrics here
  - job_name: 'my-app'
    static_configs:
      - targets: ['172.17.0.1:8080']

Prometheus will now scrape metrics from itself, the node exporter, and any app you point it at (assuming your app exposes metrics at /metrics).

Setting Up Grafana Dashboards

Once your stack is running, access Grafana at https://your-domain.com:3000:

Add Prometheus data source:

Settings → Data Sources → Add Prometheus
URL: http://prometheus:9090
Save & Test

Import pre-built dashboard:

Dashboards → Import → Enter ID 1860 (Node Exporter Full)
Select Prometheus data source
Click Import

You now have a production-ready dashboard showing CPU, RAM, disk, network, and system load. Took about 2 minutes.

The Part Everyone Messes Up: Alerting

Metrics without alerts are just pretty graphs. Here's how to get notified 30 minutes before your server crashes:

Add to your prometheus.yml:

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - 'alert.rules.yml'

Create alert.rules.yml:

groups:
  - name: system_alerts
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage is above 80% for 5 minutes"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        annotations:
          summary: "High memory usage detected"
          description: "Memory usage is above 85% for 5 minutes"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
        for: 10m
        annotations:
          summary: "Disk space running low"
          description: "Less than 15% disk space remaining"

These alerts fire when CPU hits 80%, memory hits 85%, or disk space drops below 15%. Adjust thresholds based on your setup.

Cost Comparison: SaaS vs Self-Hosted

Let's be honest about the money:

Expense	DataDog/New Relic	Self-Hosted (Elestio)
License/Platform Fee	$100-200/month	$0 (open-source)
Infrastructure	Included	$15/month (2 CPU / 4GB RAM)
Per-Host Fees	$15-31/host	$0
Total (5 hosts)	$175-355/month	$15/month
Annual Savings	-	$1,920-4,080

Running this on Elestio costs about $15/month for a 2 CPU / 4GB RAM instance. That handles monitoring for 10-20 servers easily. DataDog charges you per host, per metric, and per user. The bill gets stupid fast.

Troubleshooting Common Issues

Prometheus not scraping targets:

Check Status → Targets in Prometheus UI
Verify network connectivity: curl http://node-exporter:9100/metrics
Check firewall rules aren't blocking scrape ports

Grafana dashboard shows "No Data":

Verify Prometheus data source connection (Settings → Data Sources → Test)
Check Prometheus is successfully scraping: Status → Targets should show "UP"
Verify metric names in dashboard queries match your Prometheus metrics

Disk fills up quickly:

Prometheus default retention is 15 days - adjust --storage.tsdb.retention.time
Configure metric relabeling to drop high-cardinality labels
Consider using remote storage for long-term retention

Alerts not firing:

Check Alertmanager is running and reachable
Verify alert rules syntax: promtool check rules alert.rules.yml
Check Prometheus logs for evaluation errors

Deploy on Elestio (The Easy Way)

If you don't want to manage this yourself:

Create Grafana instance on Elestio
Create Prometheus instance on Elestio
Both come pre-configured with SSL, backups, and automatic updates
Connect Grafana to Prometheus (Elestio provides internal URLs)
Import dashboards and you're done

The whole thing takes maybe 10 minutes, and you get professional infrastructure without the ops overhead.

What You've Built

You now have:

Production-grade monitoring stack collecting system and application metrics
Real-time dashboards visualizing CPU, RAM, disk, network, and custom metrics
Proactive alerts that warn you before disasters happen
Complete control over your data (no vendor lock-in)
~$2,000-4,000/year saved compared to SaaS alternatives

This setup scales from side projects to serious production workloads. Add more exporters for databases, web servers, or custom apps. Build dashboards for business metrics. Set up federated Prometheus for multi-datacenter monitoring.

The infrastructure monitoring problem? Solved. Now go fix the bugs your dashboards are about to reveal.

Deploy Grafana on Elestio: Get Started
Deploy Prometheus on Elestio: Get Started

Thanks for reading ❤️

Grafana + Prometheus: The Complete Self-Hosted Monitoring Stack

admin

Why Prometheus + Grafana (And Why Together)

The Full Stack Setup

Prometheus Configuration

Setting Up Grafana Dashboards

The Part Everyone Messes Up: Alerting

Cost Comparison: SaaS vs Self-Hosted

Troubleshooting Common Issues

Deploy on Elestio (The Easy Way)

What You've Built

Read more

Immich API: Build Custom Photo Workflows and Automate Your Library

Cal.com API: Build Custom Booking Flows and Integrate with Your Stack

We Calculated How Much Slack Actually Costs—Then Found Two Alternatives That Save Teams $15,000/Year

Zulip API Integration: Build Custom Bots and Automate Team Notifications