Prometheus & Grafana
đ Monitoring - Prometheus & Grafana
đ Table des MatiĂšres
- Introduction au Monitoring
- Prometheus
- Grafana
- Stack complĂšte
- Cas d'usage pratiques
- Bonnes pratiques
- Troubleshooting
- Ressources
Introduction au Monitoring
Pourquoi monitorer ?
Le monitoring est essentiel pour :
- đ DĂ©tecter les problĂšmes avant qu'ils n'impactent les utilisateurs
- đ Analyser les performances et optimiser
- đš Alerter l'Ă©quipe en cas d'incident
- đ Prendre des dĂ©cisions basĂ©es sur les donnĂ©es
- đ Respecter les SLAs (Service Level Agreements)
Types de monitoring
âââââââââââââââââââââââââââââââââââââââââââââââ
â Infrastructure Monitoring â
â ââ CPU, RAM, Disk, Network â
â ââ Services systĂšme â
âââââââââââââââââââââââââââââââââââââââââââââââ€
â Application Monitoring â
â ââ MĂ©triques mĂ©tier (requests, errors) â
â ââ Performance (latency, throughput) â
âââââââââââââââââââââââââââââââââââââââââââââââ€
â Log Monitoring â
â ââ Logs applicatifs â
â ââ Logs systĂšme â
âââââââââââââââââââââââââââââââââââââââââââââââ
Les 4 Golden Signals (Google SRE)
- Latency : Temps de rĂ©ponse des requĂȘtes
- Traffic : Demande sur le systĂšme
- Errors : Taux d'erreur des requĂȘtes
- Saturation : Utilisation des ressources
Architecture Prometheus + Grafana
ââââââââââââââââ scrape ââââââââââââââ
â Targets â âââââââââââââ â Prometheus â
â (Exporters) â â Server â
ââââââââââââââââ ââââââââŹââââââ
â
â query
âŒ
ââââââââââââââââ
â Grafana â
â Dashboard â
ââââââââââââââââ
Prometheus
Qu'est-ce que Prometheus ?
Prometheus est un systÚme de monitoring et d'alerting open-source créé par SoundCloud.
Caractéristiques :
- â ModĂšle de donnĂ©es basĂ© sur des time-series
- â Langage de requĂȘte puissant : PromQL
- â Architecture pull-based (scraping)
- â Service discovery automatique
- â Pas de dĂ©pendance Ă un stockage distribuĂ©
Installation Prometheus
Via Docker
# Lancer Prometheus
docker run -d \
--name prometheus \
-p 9090:9090 \
-v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
# Accéder à l'interface
http://localhost:9090
Via Docker Compose
docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
restart: unless-stopped
volumes:
prometheus_data:
Installation native (Ubuntu/Debian)
# Créer un utilisateur systÚme
sudo useradd --no-create-home --shell /bin/false prometheus
# Télécharger Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
# Extraire
tar -xvf prometheus-2.48.0.linux-amd64.tar.gz
cd prometheus-2.48.0.linux-amd64
# Copier les binaires
sudo cp prometheus /usr/local/bin/
sudo cp promtool /usr/local/bin/
# Créer les dossiers
sudo mkdir /etc/prometheus
sudo mkdir /var/lib/prometheus
# Copier les fichiers de config
sudo cp -r consoles /etc/prometheus
sudo cp -r console_libraries /etc/prometheus
sudo cp prometheus.yml /etc/prometheus/
# Changer les permissions
sudo chown -R prometheus:prometheus /etc/prometheus
sudo chown -R prometheus:prometheus /var/lib/prometheus
# Créer le service systemd
sudo tee /etc/systemd/system/prometheus.service > /dev/null <<EOF
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries
[Install]
WantedBy=multi-user.target
EOF
# Démarrer Prometheus
sudo systemctl daemon-reload
sudo systemctl start prometheus
sudo systemctl enable prometheus
sudo systemctl status prometheus
Configuration Prometheus
prometheus.yml (configuration de base)
global:
scrape_interval: 15s # Fréquence de scraping
evaluation_interval: 15s # Fréquence d'évaluation des rÚgles
external_labels:
cluster: 'production'
region: 'eu-west-1'
# Configuration des alertes
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager:9093'
# RĂšgles d'alerte et d'enregistrement
rule_files:
- 'alerts/*.yml'
- 'rules/*.yml'
# Configuration des cibles Ă scraper
scrape_configs:
# Prometheus lui-mĂȘme
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter (métriques systÚme)
- job_name: 'node'
static_configs:
- targets:
- 'node-exporter:9100'
labels:
env: 'production'
role: 'web'
# Application custom
- job_name: 'my-app'
static_configs:
- targets: ['app:8080']
metrics_path: '/metrics'
scrape_interval: 10s
Configuration avancée avec service discovery
scrape_configs:
# Découverte automatique Kubernetes
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Découverte Docker
- job_name: 'docker'
docker_sd_configs:
- host: unix:///var/run/docker.sock
relabel_configs:
- source_labels: [__meta_docker_container_label_monitoring]
action: keep
regex: true
# Découverte Consul
- job_name: 'consul-services'
consul_sd_configs:
- server: 'consul:8500'
services: []
PromQL - Langage de requĂȘte
Syntaxe de base
# Métrique simple
http_requests_total
# Avec labels
http_requests_total{job="api", status="200"}
# Opérateurs de comparaison
http_requests_total > 100
http_requests_total{status="500"} > 0
# Range vector (derniĂšres 5 minutes)
http_requests_total[5m]
# Rate (requĂȘtes par seconde)
rate(http_requests_total[5m])
# Somme
sum(http_requests_total)
# Somme par label
sum by (status) (http_requests_total)
Fonctions courantes
# RATE - Taux de changement par seconde
rate(http_requests_total[5m])
# INCREASE - Augmentation sur une période
increase(http_requests_total[1h])
# AVG - Moyenne
avg(node_cpu_seconds_total)
# MAX/MIN
max(node_memory_usage_bytes)
min(node_memory_usage_bytes)
# HISTOGRAM - Percentiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# PREDICT_LINEAR - Prédiction
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0
Exemples pratiques
# CPU usage en pourcentage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Mémoire disponible en pourcentage
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# RequĂȘtes HTTP par seconde
sum(rate(http_requests_total[5m])) by (status)
# Taux d'erreur HTTP (%)
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
# P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Top 5 des endpoints les plus lents
topk(5, avg(http_request_duration_seconds) by (endpoint))
# Prédiction de saturation disque
predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0
Exporters
Les exporters exposent des métriques dans un format que Prometheus peut scraper.
Node Exporter (métriques systÚme)
# Docker
docker run -d \
--name node-exporter \
--net="host" \
--pid="host" \
-v "/:/host:ro,rslave" \
prom/node-exporter \
--path.rootfs=/host
# Installation native
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -xvf node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
# Service systemd
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl start node_exporter
sudo systemctl enable node_exporter
Métriques Node Exporter :
node_cpu_seconds_total # CPU
node_memory_MemAvailable_bytes # RAM
node_disk_io_time_seconds_total # Disk I/O
node_network_receive_bytes_total # Network
node_filesystem_avail_bytes # Filesystem
Exporters populaires
PostgreSQL Exporter
docker run -d \
--name postgres-exporter \
-p 9187:9187 \
-e DATA_SOURCE_NAME="postgresql://user:password@postgres:5432/dbname?sslmode=disable" \
prometheuscommunity/postgres-exporter
MySQL Exporter
docker run -d \
--name mysql-exporter \
-p 9104:9104 \
-e DATA_SOURCE_NAME="user:password@(mysql:3306)/" \
prom/mysqld-exporter
Redis Exporter
docker run -d \
--name redis-exporter \
-p 9121:9121 \
oliver006/redis_exporter \
--redis.addr=redis://redis:6379
Nginx Exporter
docker run -d \
--name nginx-exporter \
-p 9113:9113 \
nginx/nginx-prometheus-exporter:latest \
-nginx.scrape-uri=http://nginx:8080/stub_status
Blackbox Exporter (monitoring externe)
docker run -d \
--name blackbox-exporter \
-p 9115:9115 \
-v $(pwd)/blackbox.yml:/config/blackbox.yml \
prom/blackbox-exporter \
--config.file=/config/blackbox.yml
blackbox.yml
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_status_codes: []
method: GET
preferred_ip_protocol: "ip4"
tcp_connect:
prober: tcp
timeout: 5s
Instrumenter votre application
Python (Flask)
from prometheus_client import Counter, Histogram, generate_latest
from flask import Flask, Response
import time
app = Flask(__name__)
# Métriques
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP Requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP Request Latency',
['method', 'endpoint']
)
@app.before_request
def before_request():
request.start_time = time.time()
@app.after_request
def after_request(response):
latency = time.time() - request.start_time
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.endpoint,
status=response.status_code
).inc()
REQUEST_LATENCY.labels(
method=request.method,
endpoint=request.endpoint
).observe(latency)
return response
@app.route('/metrics')
def metrics():
return Response(generate_latest(), mimetype='text/plain')
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
Node.js (Express)
const express = require('express');
const client = require('prom-client');
const app = express();
// Créer un registre
const register = new client.Registry();
// Métriques par défaut
client.collectDefaultMetrics({ register });
// Métriques custom
const httpRequestCounter = new client.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status'],
registers: [register]
});
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'route'],
registers: [register]
});
// Middleware
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestCounter.inc({
method: req.method,
route: req.route?.path || req.path,
status: res.statusCode
});
httpRequestDuration.observe({
method: req.method,
route: req.route?.path || req.path
}, duration);
});
next();
});
// Endpoint metrics
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
app.listen(8080);
Go
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}
AlertManager
AlertManager gÚre les alertes envoyées par Prometheus.
Installation
# Docker
docker run -d \
--name alertmanager \
-p 9093:9093 \
-v /path/to/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
prom/alertmanager
Configuration AlertManager
alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
# Templates personnalisés
templates:
- '/etc/alertmanager/templates/*.tmpl'
# Routage des alertes
route:
receiver: 'default'
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
# Alertes critiques
- match:
severity: critical
receiver: 'pagerduty'
continue: true
# Alertes warning
- match:
severity: warning
receiver: 'slack'
# Alertes par équipe
- match:
team: frontend
receiver: 'frontend-team'
- match:
team: backend
receiver: 'backend-team'
# Inhibition (éviter les alertes redondantes)
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
# Receivers (destinations des alertes)
receivers:
- name: 'default'
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.gmail.com:587'
auth_username: '[email protected]'
auth_password: 'password'
- name: 'slack'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
send_resolved: true
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
severity: '{{ .GroupLabels.severity }}'
description: '{{ .GroupLabels.alertname }}'
- name: 'frontend-team'
slack_configs:
- channel: '#frontend-alerts'
send_resolved: true
- name: 'backend-team'
slack_configs:
- channel: '#backend-alerts'
send_resolved: true
RĂšgles d'alerte
alerts/node_alerts.yml
groups:
- name: node_alerts
interval: 30s
rules:
# CPU usage élevé
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "CPU usage élevé sur {{ $labels.instance }}"
description: "CPU usage est Ă {{ $value }}% depuis plus de 5 minutes"
# Mémoire faible
- alert: LowMemory
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Mémoire disponible faible sur {{ $labels.instance }}"
description: "Seulement {{ $value }}% de mémoire disponible"
# Disque presque plein
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes) * 100 < 10
for: 10m
labels:
severity: warning
team: infrastructure
annotations:
summary: "Espace disque faible sur {{ $labels.instance }}"
description: "{{ $labels.mountpoint }} a seulement {{ $value }}% d'espace libre"
# Prédiction saturation disque
- alert: DiskWillFillIn4Hours
expr: predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 4 * 3600) < 0
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "Disque sera plein dans 4h sur {{ $labels.instance }}"
description: "{{ $labels.mountpoint }} sera plein dans environ 4 heures"
alerts/app_alerts.yml
groups:
- name: application_alerts
interval: 15s
rules:
# Taux d'erreur élevé
- alert: HighErrorRate
expr: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100 > 5
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "Taux d'erreur HTTP élevé"
description: "Taux d'erreur 5xx est Ă {{ $value }}%"
# Latence élevée
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "Latence P95 élevée"
description: "P95 latency est Ă {{ $value }}s"
# Service down
- alert: ServiceDown
expr: up{job="my-app"} == 0
for: 1m
labels:
severity: critical
team: backend
annotations:
summary: "Service {{ $labels.job }} est DOWN"
description: "Le service {{ $labels.job }} sur {{ $labels.instance }} est inaccessible"
# Trop de requĂȘtes
- alert: HighTraffic
expr: sum(rate(http_requests_total[5m])) > 1000
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "Trafic élevé détecté"
description: "{{ $value }} requĂȘtes par seconde"
Grafana
Qu'est-ce que Grafana ?
Grafana est une plateforme open-source de visualisation et d'analyse de métriques.
Caractéristiques :
- â Dashboards interactifs et personnalisables
- â Support de multiples sources de donnĂ©es
- â SystĂšme d'alerting intĂ©grĂ©
- â Gestion d'Ă©quipes et permissions
- â Templating et variables
Installation Grafana
Via Docker
docker run -d \
--name grafana \
-p 3000:3000 \
-v grafana-storage:/var/lib/grafana \
grafana/grafana-oss
# Accéder à l'interface
# URL: http://localhost:3000
# User: admin
# Password: admin
Via Docker Compose (avec Prometheus)
docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerts:/etc/prometheus/alerts
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
restart: unless-stopped
networks:
- monitoring
grafana:
image: grafana/grafana-oss:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_INSTALL_PLUGINS=grafana-clock-panel
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
depends_on:
- prometheus
restart: unless-stopped
networks:
- monitoring
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
restart: unless-stopped
networks:
- monitoring
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
restart: unless-stopped
networks:
- monitoring
networks:
monitoring:
driver: bridge
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
Installation native (Ubuntu/Debian)
# Ajouter le repo Grafana
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
# Installer
sudo apt-get update
sudo apt-get install grafana
# Démarrer
sudo systemctl start grafana-server
sudo systemctl enable grafana-server
sudo systemctl status grafana-server
# Accéder à l'interface
# URL: http://localhost:3000
# User: admin
# Password: admin
Configuration Grafana
Provisioning automatique
grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
jsonData:
timeInterval: "15s"
httpMethod: POST
grafana/provisioning/dashboards/dashboard.yml
apiVersion: 1
providers:
- name: 'Default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /etc/grafana/provisioning/dashboards
Configuration avancée
grafana.ini
[server]
protocol = http
http_port = 3000
domain = grafana.example.com
root_url = %(protocol)s://%(domain)s:%(http_port)s/
[database]
type = postgres
host = postgres:5432
name = grafana
user = grafana
password = password
[auth]
disable_login_form = false
[auth.anonymous]
enabled = false
[smtp]
enabled = true
host = smtp.gmail.com:587
user = [email protected]
password = password
from_address = [email protected]
from_name = Grafana
[alerting]
enabled = true
execute_alerts = true
Dashboards
Créer un Dashboard
-
Via l'interface web :
- Cliquer sur "+" â "Dashboard"
- "Add new panel"
- Configurer la requĂȘte PromQL
- Sélectionner le type de visualisation
- Sauvegarder
-
Via JSON :
dashboard_node_exporter.json
{
"dashboard": {
"title": "Node Exporter Full",
"tags": ["prometheus", "node-exporter"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{ instance }}"
}
],
"yaxes": [
{
"format": "percent",
"min": 0,
"max": 100
}
]
},
{
"id": 2,
"title": "Memory Usage",
"type": "gauge",
"targets": [
{
"expr": "100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"mode": "absolute",
"steps": [
{"value": 0, "color": "green"},
{"value": 70, "color": "yellow"},
{"value": 90, "color": "red"}
]
}
}
}
}
]
}
}
Dashboards recommandés
ID de dashboards communautaires :
# Node Exporter Full
Dashboard ID: 1860
# Docker Monitoring
Dashboard ID: 893
# Kubernetes Cluster Monitoring
Dashboard ID: 7249
# Nginx Metrics
Dashboard ID: 12708
# PostgreSQL Database
Dashboard ID: 9628
# MySQL Overview
Dashboard ID: 7362
Importer un dashboard :
- Aller dans "+" â "Import"
- Entrer l'ID du dashboard
- Sélectionner la source de données Prometheus
- "Import"
Variables de Dashboard
Créer des variables dynamiques :
Name: instance
Type: Query
Query: label_values(node_uname_info, instance)
Refresh: On Dashboard Load
Name: job
Type: Query
Query: label_values(up, job)
Utiliser dans les requĂȘtes :
up{instance="$instance", job="$job"}
Alertes Grafana
Créer une alerte
Via l'interface :
- Ăditer un panel
- Aller dans l'onglet "Alert"
- "Create Alert"
- Définir les conditions
Exemple d'alerte :
Name: High CPU Alert
Condition:
WHEN avg() OF query(A, 5m, now)
IS ABOVE 80
Notifications:
- Slack Channel
- Email Team
Configuration des canaux de notification
Slack
{
"url": "https://hooks.slack.com/services/YOUR/WEBHOOK/URL",
"recipient": "#alerts",
"username": "Grafana",
"icon_emoji": ":grafana:",
"mentionChannel": "here"
}
Addresses: [email protected], [email protected]
Webhook
URL: https://api.example.com/webhook
HTTP Method: POST
Stack complĂšte de monitoring
Docker Compose complet
docker-compose.monitoring.yml
version: '3.8'
services:
# Prometheus
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alerts:/etc/prometheus/alerts
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
restart: unless-stopped
networks:
- monitoring
# Grafana
grafana:
image: grafana/grafana-oss:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin123}
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
depends_on:
- prometheus
restart: unless-stopped
networks:
- monitoring
# Node Exporter
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
restart: unless-stopped
networks:
- monitoring
# cAdvisor - Container metrics
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
privileged: true
restart: unless-stopped
networks:
- monitoring
# AlertManager
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
restart: unless-stopped
networks:
- monitoring
# Blackbox Exporter
blackbox-exporter:
image: prom/blackbox-exporter:latest
container_name: blackbox-exporter
ports:
- "9115:9115"
volumes:
- ./blackbox/blackbox.yml:/etc/blackbox/blackbox.yml
command:
- '--config.file=/etc/blackbox/blackbox.yml'
restart: unless-stopped
networks:
- monitoring
# PostgreSQL Exporter (si vous avez PostgreSQL)
postgres-exporter:
image: prometheuscommunity/postgres-exporter:latest
container_name: postgres-exporter
ports:
- "9187:9187"
environment:
- DATA_SOURCE_NAME=postgresql://user:password@postgres:5432/dbname?sslmode=disable
restart: unless-stopped
networks:
- monitoring
networks:
monitoring:
driver: bridge
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
Démarrage
# Lancer la stack complĂšte
docker-compose -f docker-compose.monitoring.yml up -d
# Vérifier les services
docker-compose -f docker-compose.monitoring.yml ps
# Voir les logs
docker-compose -f docker-compose.monitoring.yml logs -f
# ArrĂȘter la stack
docker-compose -f docker-compose.monitoring.yml down
Cas d'usage pratiques
Monitoring d'une application web
# RequĂȘtes par seconde
sum(rate(http_requests_total[5m]))
# Taux d'erreur
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
# P50, P95, P99 latency
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Top endpoints les plus lents
topk(10, avg by (endpoint) (http_request_duration_seconds))
Monitoring base de données
# PostgreSQL - Connexions actives
pg_stat_activity_count
# PostgreSQL - Taille de la base
pg_database_size_bytes
# PostgreSQL - Slow queries
rate(pg_stat_statements_mean_time_seconds[5m]) > 1
# MySQL - RequĂȘtes par seconde
rate(mysql_global_status_queries[5m])
# MySQL - Threads connectés
mysql_global_status_threads_connected
Monitoring Kubernetes
# Pods en état non-running
count(kube_pod_status_phase{phase!="Running"})
# Utilisation CPU par pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
# Utilisation mémoire par namespace
sum(container_memory_usage_bytes) by (namespace)
# Restarts fréquents
rate(kube_pod_container_status_restarts_total[1h]) > 0
Monitoring réseau
# Bande passante entrante
rate(node_network_receive_bytes_total[5m])
# Bande passante sortante
rate(node_network_transmit_bytes_total[5m])
# Paquets perdus
rate(node_network_receive_drop_total[5m])
# Connexions TCP
node_netstat_Tcp_CurrEstab
Bonnes pratiques
1. Nommage des métriques
# Convention : type_component_unit
http_requests_total # â
http_request_duration_seconds # â
node_memory_usage_bytes # â
requests # â Trop vague
http_req # â AbrĂ©viation
2. Labels
# Bonnes pratiques
http_requests_total{method="GET", status="200", endpoint="/api/users"}
# à éviter
http_requests_total{user_id="12345"} # â CardinalitĂ© trop Ă©levĂ©e
http_requests_total{timestamp="..."} # â Timestamp dĂ©jĂ gĂ©rĂ©
3. Rétention des données
# prometheus.yml
global:
# Garder 30 jours de données
storage.tsdb.retention.time: 30d
# Limite de taille
storage.tsdb.retention.size: 50GB
4. Recording Rules
Pour les requĂȘtes lourdes, prĂ©calculer les rĂ©sultats :
groups:
- name: cpu_recording_rules
interval: 30s
rules:
- record: instance:node_cpu_utilization:rate5m
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
5. Organisation des dashboards
/dashboards
âââ infrastructure/
â âââ overview.json
â âââ nodes.json
â âââ network.json
âââ applications/
â âââ api.json
â âââ web.json
â âââ workers.json
âââ databases/
âââ postgresql.json
âââ redis.json
6. Alertes efficaces
# â
Bon
- alert: HighErrorRate
expr: rate(http_requests_total{status="500"}[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate on {{ $labels.instance }}"
description: "Error rate is {{ $value }} req/s"
# â Ă Ă©viter (alerte trop sensible)
- alert: AnyError
expr: http_requests_total{status="500"} > 0
for: 1s
7. Sécurité
# Activer l'authentification
# grafana.ini
[auth.basic]
enabled = true
# Activer HTTPS
[server]
protocol = https
cert_file = /etc/ssl/grafana.crt
cert_key = /etc/ssl/grafana.key
# Prometheus avec basic auth
# prometheus.yml
basic_auth:
username: admin
password: secret
Troubleshooting
Prometheus ne scrape pas les targets
# Vérifier la config
promtool check config prometheus.yml
# Vérifier les targets
curl http://localhost:9090/api/v1/targets
# Logs Prometheus
docker logs prometheus
# Tester manuellement le scraping
curl http://target:9100/metrics
Grafana ne se connecte pas Ă Prometheus
# Vérifier la connexion
docker exec grafana curl http://prometheus:9090/api/v1/query?query=up
# Vérifier les logs
docker logs grafana
# Tester la datasource
# Settings â Data Sources â Prometheus â Save & Test
Métriques manquantes
# Lister toutes les métriques disponibles
{__name__=~".+"}
# Vérifier si une métrique existe
count({__name__="http_requests_total"})
# Voir les labels d'une métrique
http_requests_total
Alertes ne se déclenchent pas
# Vérifier les rÚgles
promtool check rules alerts/*.yml
# Voir les alertes actives
curl http://localhost:9090/api/v1/alerts
# Tester une alerte
promtool test rules test.yml
Performance dégradée
# Vérifier l'utilisation mémoire
docker stats prometheus
# Voir le nombre de time series
curl http://localhost:9090/api/v1/status/tsdb
# Réduire la rétention
--storage.tsdb.retention.time=15d
# Augmenter les ressources
docker run -m 4g --cpus=2 prom/prometheus
Ressources
Documentation officielle
- đ Prometheus Docs
- đ Grafana Docs
- đ PromQL Basics
Outils
- đ ïž PromLens - Query builder
- đ ïž Prometheus Playground
- đ ïž Grafana Play
Dashboards communautaires
- đš Grafana Dashboards
- đš Awesome Prometheus
Exporters
- đŠ Prometheus Exporters
No comments to display
No comments to display