监控服务部署指南
概述
本指南介绍如何在 monitor-storage (192.168.1.162) 上独立部署 Prometheus + Grafana 监控系统。
架构特点:
- 独立部署,不依赖 K8s 集群
- 使用 Docker Compose 管理服务
- 通过 EasyTier 网络访问
- 数据持久化到本地存储
一、环境准备
1.1 系统要求
| 项目 | 规格 |
|---|---|
| CPU | 4 核 |
| 内存 | 4 GB |
| 系统盘 | 32 GB |
| 数据盘 | 200GB(/dev/sdb) |
| 操作系统 | Ubuntu 24.04 |
1.2 安装 Docker
bash
# 更新系统
sudo apt update && sudo apt upgrade -y
# 安装 Docker
curl -fsSL https://get.docker.com | sh
# 添加当前用户到 docker 组
sudo usermod -aG docker $USER
newgrp docker
# 验证安装
docker --version
docker compose version二、目录规划
bash
# 创建服务目录
sudo mkdir -p /opt/{prometheus,grafana,nfs-exports}
sudo mkdir -p /opt/nfs-exports/{backups,configs,shared}
# 创建数据目录
sudo mkdir -p /opt/prometheus/{data,conf}
sudo mkdir -p /opt/grafana/{data,dashboards,datasources}
# 设置权限
sudo chown -R $USER:$USER /opt/prometheus /opt/grafana /opt/nfs-exports三、部署 Prometheus
3.1 创建配置文件
bash
cat << 'EOF' > /opt/prometheus/conf/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'home-lab'
environment: 'monitoring'
alerting:
alertmanagers:
- static_configs:
- targets: []
rule_files: []
scrape_configs:
# Prometheus 自身监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
metrics_path: /metrics
# Node Exporter - 主机监控
- job_name: 'node-exporter'
static_configs:
- targets:
- '192.168.1.161:9100' # easytier-gateway
- '192.168.1.162:9100' # monitor-storage
- '192.168.1.163:9100' # gitea-artifact
- '192.168.1.165:9100' # k8s-master
- '192.168.1.166:9100' # k8s-worker-1
EOF3.2 创建 Docker Compose 文件
bash
cat << 'EOF' > /opt/prometheus/docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.50.0
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./conf/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
networks:
- monitoring
healthcheck:
test: ["CMD", "wget", "-q", "--spider", "http://localhost:9090/-/healthy"]
interval: 30s
timeout: 10s
retries: 3
node-exporter:
image: prom/node-exporter:v1.7.0
container_name: node-exporter
restart: unless-stopped
ports:
- "9100:9100"
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/host'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/host:ro
networks:
- monitoring
networks:
monitoring:
name: monitoring
EOF3.3 启动 Prometheus
bash
cd /opt/prometheus
# 拉取镜像
docker compose pull
# 启动服务
docker compose up -d
# 查看状态
docker compose ps
# 查看日志
docker compose logs -f prometheus3.4 验证
浏览器访问:http://192.168.1.162:9090
检查 Targets 页面,确认所有节点状态为 UP。
四、部署 Grafana
4.1 创建数据源配置
bash
cat << 'EOF' > /opt/grafana/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
uid: prometheus
EOF4.2 创建仪表盘配置
bash
cat << 'EOF' > /opt/grafana/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /etc/grafana/provisioning/dashboards
EOF4.3 创建 Docker Compose 文件
bash
cat << 'EOF' > /opt/grafana/docker-compose.yml
version: '3.8'
services:
grafana:
image: grafana/grafana:10.4.0
container_name: grafana
restart: unless-stopped
ports:
- "3000:3000"
volumes:
- ./data:/var/lib/grafana
- ./dashboards:/etc/grafana/provisioning/dashboards
- ./datasources:/etc/grafana/provisioning/datasources
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
- GF_SERVER_ROOT_URL=http://monitor.hoseahu.cn/
- GF_SERVER_DOMAIN=monitor.hoseahu.cn
networks:
- monitoring
networks:
monitoring:
name: monitoring
external: true
EOF4.4 启动 Grafana
bash
cd /opt/grafana
# 拉取镜像
docker compose pull
# 启动服务
docker compose up -d
# 查看状态
docker compose ps4.5 验证
- 浏览器访问:http://192.168.1.162:3000
- 默认账号密码:admin / admin123
- 首次登录会要求修改密码
五、配置 Node Exporter
在所有需要监控的服务器上安装 Node Exporter:
bash
# 在所有被监控节点执行
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvf node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64
sudo cp node_exporter /usr/local/bin/
# 创建 systemd 服务
cat << 'EOF' | sudo tee /etc/systemd/system/node-exporter.service
[Unit]
Description=Node Exporter
After=network-online.target
[Service]
User=root
ExecStart=/usr/local/bin/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
# 启动服务
sudo systemctl daemon-reload
sudo systemctl enable node-exporter
sudo systemctl start node-exporter
# 检查状态
sudo systemctl status node-exporter六、导入官方仪表盘
6.1 Node Exporter Full
Grafana Dashboard ID: 1860
导入步骤:
- 登录 Grafana
- 点击左侧菜单 → Dashboards → Import
- 输入 Dashboard ID: 1860
- 选择 Prometheus 数据源
- 点击 Import
6.2 常用仪表盘
| Dashboard ID | 名称 | 用途 |
|---|---|---|
| 1860 | Node Exporter Full | 主机监控 |
| 3662 | Prometheus Stats | Prometheus 状态 |
| 179 | Docker and Container Monitoring | Docker 监控 |
七、配置告警
7.1 创建告警规则
bash
cat << 'EOF' > /opt/prometheus/conf/alert_rules.yml
groups:
- name: host_alerts
interval: 30s
rules:
- alert: HostHighCpuLoad
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU load on {{ $labels.instance }}"
- alert: HostHighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
- alert: HostDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Host {{ $labels.instance }} is down"
EOF7.2 重载配置
bash
curl -X POST http://localhost:9090/-/reload八、备份策略
8.1 备份脚本
bash
cat << 'EOF' > /opt/scripts/backup-monitoring.sh
#!/bin/bash
# 备份监控数据
BACKUP_DIR="/exports/backups/monitoring"
DATE=$(date +%Y%m%d_%H%M%S)
mkdir -p $BACKUP_DIR
# 备份 Prometheus 数据
tar czf $BACKUP_DIR/prometheus_data_$DATE.tar.gz -C /opt/prometheus data/
# 备份 Grafana 配置
tar czf $BACKUP_DIR/grafana_config_$DATE.tar.gz -C /opt/grafana data/
# 保留最近 7 天的备份
find $BACKUP_DIR -name "*.tar.gz" -mtime +7 -delete
echo "Backup completed: $DATE"
EOF
chmod +x /opt/scripts/backup-monitoring.sh8.2 配置定时备份
bash
# 添加 cron 任务(每天凌晨 3 点执行)
(crontab -l 2>/dev/null; echo "0 3 * * * /opt/scripts/backup-monitoring.sh") | crontab -九、故障排查
常用诊断命令
bash
# Prometheus
docker compose -f /opt/prometheus/docker-compose.yml ps
docker compose -f /opt/prometheus/docker-compose.yml logs prometheus
# Grafana
docker compose -f /opt/grafana/docker-compose.yml ps
docker compose -f /opt/grafana/docker-compose.yml logs grafana
# 健康检查
curl http://localhost:9090/-/healthy
curl http://localhost:3000/api/health
curl http://localhost:9100/metrics常见问题
| 问题 | 原因 | 解决方法 |
|---|---|---|
| Prometheus 无法连接数据源 | 网络问题 | 检查防火墙和容器网络 |
| Grafana 仪表盘无数据 | 数据源配置错误 | 检查 Prometheus 数据源配置 |
| Node Exporter 无数据 | 服务未启动 | 检查 systemctl status |
十、访问配置
通过 EasyTier 访问
| 服务 | 地址 | 说明 |
|---|---|---|
| Prometheus | http://192.168.1.162:9090 | prometheus.hoseahu.cn |
| Grafana | http://192.168.1.162:3000 | monitor.hoseahu.cn |
| Node Exporter | http://192.168.1.162:9100 | 本机指标 |