68edc524e3
- traefik-architect: Traefik v3 reverse proxy patterns - docker-compose-architect: compose.yaml conventions and templates - gitea-actions-cd: workflow_dispatch CD pattern, Linux+Windows targets - web-security-hardening: OWASP Top 10, CSP, CrowdSec, sops+age - backup-restore: restic + WAL-G, GFS retention, tested restore - observability: Prometheus + Loki + Grafana + Alertmanager README: regenerated skill table and added 'Web stack skills bundle' section showing recommended composition order.
18 KiB
18 KiB
name, version, description, command
| name | version | description | command |
|---|---|---|---|
| observability | 0.1.0 | Self-hosted observability stack. Prometheus + Grafana + Loki + Alertmanager + cAdvisor + node_exporter + blackbox_exporter. Service-instrumentation patterns (OpenTelemetry, Python/Node), dashboards as code, alerting rules, Telegram delivery via TGServerService bot. | /observability |
Observability
Ты — инженер по мониторингу и наблюдаемости. Стек — Prometheus + Loki + Grafana, всё self-hosted на docker host пользователя. Никаких Datadog/NewRelic.
Жёсткие инварианты
- Три столпа: metrics (Prometheus), logs (Loki), traces (опционально Tempo). Без metrics нет SLO, без logs нет debug, без traces нет distributed performance.
- Метрики — pull-модель: Prometheus scrape'ит exporters. Push (Pushgateway) — только для batch-job'ов.
- Cardinality control: НИКОГДА не метить метрику high-cardinality лейблами (user_id, request_id). Только bounded set (status_code, method, endpoint_pattern).
- Alert hygiene: алерт = «человек должен немедленно что-то сделать». Если не должен — это лог/дашборд, не алерт.
- Retention: Prometheus локально 30 дней, долговременное — Thanos/Mimir в S3 (опционально). Loki — 90 дней с compression.
- Dashboards as code: каждый дашборд в git как JSON + provisioning.
- Все сервисы экспортируют /metrics: либо нативно (FastAPI +
prometheus_client), либо через sidecar exporter.
Базовый стек
# /opt/observability/compose.yaml
services:
prometheus:
image: prom/prometheus:v2.55.1
container_name: prometheus
restart: unless-stopped
user: "65534:65534"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=50GB'
- '--web.enable-lifecycle'
- '--web.enable-remote-write-receiver'
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/rules:/etc/prometheus/rules:ro
- prometheus_data:/prometheus
networks:
- obs
- traefik_proxy
healthcheck:
test: ["CMD", "wget", "-qO-", "http://localhost:9090/-/healthy"]
interval: 30s
labels:
- "traefik.enable=true"
- "traefik.http.routers.prometheus.rule=Host(`prom.abelentsev.pro`)"
- "traefik.http.routers.prometheus.entrypoints=websecure"
- "traefik.http.routers.prometheus.tls.certresolver=cloudflare"
- "traefik.http.routers.prometheus.middlewares=auth-basic@file,ipallowlist-lan@file"
- "traefik.http.services.prometheus.loadbalancer.server.port=9090"
alertmanager:
image: prom/alertmanager:v0.28.0
container_name: alertmanager
restart: unless-stopped
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- alertmanager_data:/alertmanager
secrets:
- tg_bot_token
- tg_chat_id
networks:
- obs
loki:
image: grafana/loki:3.3.2
container_name: loki
restart: unless-stopped
user: "10001:10001"
command: ["-config.file=/etc/loki/loki.yml"]
volumes:
- ./loki/loki.yml:/etc/loki/loki.yml:ro
- loki_data:/loki
networks:
- obs
promtail:
image: grafana/promtail:3.3.2
container_name: promtail
restart: unless-stopped
command: ["-config.file=/etc/promtail/promtail.yml"]
volumes:
- ./promtail/promtail.yml:/etc/promtail/promtail.yml:ro
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /opt/traefik/logs:/logs/traefik:ro
networks:
- obs
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
container_name: cadvisor
restart: unless-stopped
privileged: true
devices:
- /dev/kmsg
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
- /dev/disk:/dev/disk:ro
networks:
- obs
node-exporter:
image: prom/node-exporter:v1.8.2
container_name: node-exporter
restart: unless-stopped
network_mode: host
pid: host
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/host/root'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/host/root:ro,rslave
blackbox-exporter:
image: prom/blackbox-exporter:v0.25.0
container_name: blackbox-exporter
restart: unless-stopped
volumes:
- ./blackbox/blackbox.yml:/etc/blackbox_exporter/config.yml:ro
networks:
- obs
grafana:
image: grafana/grafana:11.4.0
container_name: grafana
restart: unless-stopped
user: "472:472"
environment:
- GF_SERVER_ROOT_URL=https://grafana.abelentsev.pro
- GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/grafana_admin
- GF_USERS_ALLOW_SIGN_UP=false
- GF_AUTH_ANONYMOUS_ENABLED=false
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
- ./grafana/dashboards:/var/lib/grafana/dashboards:ro
secrets:
- grafana_admin
networks:
- obs
- traefik_proxy
labels:
- "traefik.enable=true"
- "traefik.http.routers.grafana.rule=Host(`grafana.abelentsev.pro`)"
- "traefik.http.routers.grafana.entrypoints=websecure"
- "traefik.http.routers.grafana.tls.certresolver=cloudflare"
- "traefik.http.routers.grafana.middlewares=security-headers@file"
- "traefik.http.services.grafana.loadbalancer.server.port=3000"
networks:
obs:
traefik_proxy:
external: true
volumes:
prometheus_data:
alertmanager_data:
loki_data:
grafana_data:
secrets:
grafana_admin:
file: ./secrets/grafana_admin.txt
tg_bot_token:
file: ./secrets/tg_bot_token.txt
tg_chat_id:
file: ./secrets/tg_chat_id.txt
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: home
environment: production
alerting:
alertmanagers:
- static_configs:
- targets: [alertmanager:9093]
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
- job_name: prometheus
static_configs:
- targets: [localhost:9090]
- job_name: node-exporter
static_configs:
- targets:
- 192.168.9.147:9100 # docker host
- 192.168.7.179:9100 # sonar host
- 192.168.7.195:9100 # win host (windows_exporter)
- job_name: cadvisor
static_configs:
- targets: [cadvisor:8080]
- job_name: traefik
metrics_path: /metrics
static_configs:
- targets: [traefik:8080]
- job_name: blackbox-http
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.abelentsev.pro
- https://grafana.abelentsev.pro
- https://git.h3fq32.golive.ru
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
- job_name: docker-services
docker_sd_configs:
- host: unix:///var/run/docker.sock
filters:
- name: label
values: ["prometheus.scrape=true"]
relabel_configs:
- source_labels: [__meta_docker_container_label_prometheus_port]
target_label: __address__
regex: (.+)
replacement: ${1}
- source_labels: [__meta_docker_container_name]
target_label: container
Сервис, который должен скрейпиться:
labels:
- "prometheus.scrape=true"
- "prometheus.port=myapp:8000"
Alerting rules (rules/web.yml)
groups:
- name: web-services
interval: 30s
rules:
- alert: ServiceDown
expr: probe_success{job="blackbox-http"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "Blackbox probe failed for 2 minutes"
- alert: HighErrorRate
expr: |
sum by (service) (rate(traefik_service_requests_total{code=~"5.."}[5m]))
/
sum by (service) (rate(traefik_service_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High 5xx rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} (>5%)"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum by (service, le) (rate(traefik_service_request_duration_seconds_bucket[5m]))
) > 1.0
for: 10m
labels:
severity: warning
annotations:
summary: "P95 latency >1s on {{ $labels.service }}"
- alert: CertExpiringSoon
expr: probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time() < 7 * 24 * 3600
for: 1h
labels:
severity: warning
annotations:
summary: "TLS cert {{ $labels.instance }} expires in <7 days"
- name: host
rules:
- alert: HostHighCpu
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 10m
labels:
severity: warning
- alert: HostLowDisk
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) < 0.10
for: 5m
labels:
severity: critical
- alert: HostHighMemory
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.90
for: 10m
labels:
severity: warning
- alert: ContainerOomKilled
expr: rate(container_memory_failures_total{failure_type="oom"}[5m]) > 0
for: 0m
labels:
severity: warning
- name: backups
rules:
- alert: BackupMissing
expr: time() - max(restic_last_successful_backup_timestamp) > 36 * 3600
for: 0m
labels:
severity: critical
annotations:
summary: "No successful backup in last 36 hours"
Alertmanager → Telegram
# alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: telegram-critical
group_by: [alertname, severity]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- matchers: [severity="critical"]
receiver: telegram-critical
- matchers: [severity="warning"]
receiver: telegram-warning
repeat_interval: 12h
receivers:
- name: telegram-critical
telegram_configs:
- bot_token_file: /run/secrets/tg_bot_token
chat_id: -100123456789 # из файла tg_chat_id
parse_mode: HTML
message: |
🔥 <b>CRITICAL: {{ .CommonLabels.alertname }}</b>
{{ range .Alerts }}
• <b>{{ .Labels.instance }}</b>: {{ .Annotations.summary }}
{{ end }}
- name: telegram-warning
telegram_configs:
- bot_token_file: /run/secrets/tg_bot_token
chat_id: -100123456789
parse_mode: HTML
message: |
⚠️ <b>{{ .CommonLabels.alertname }}</b>
{{ range .Alerts }}
• {{ .Annotations.summary }}
{{ end }}
inhibit_rules:
- source_matchers: [severity="critical"]
target_matchers: [severity="warning"]
equal: [alertname, instance]
Loki loki.yml
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
limits_config:
retention_period: 90d
max_query_series: 5000
compactor:
working_directory: /loki/compactor
delete_request_store: filesystem
retention_enabled: true
Promtail — Traefik access log
# promtail.yml
server:
http_listen_port: 9080
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: traefik-access
static_configs:
- targets: [localhost]
labels:
job: traefik
__path__: /logs/traefik/access.log
pipeline_stages:
- json:
expressions:
method: RequestMethod
host: RequestHost
status: DownstreamStatus
duration: Duration
- labels:
method:
host:
status:
- job_name: docker
docker_sd_configs:
- host: unix:///var/run/docker.sock
relabel_configs:
- source_labels: [__meta_docker_container_name]
target_label: container
- source_labels: [__meta_docker_container_log_stream]
target_label: stream
Инструментация: FastAPI пример
from prometheus_client import Counter, Histogram, make_asgi_app
from fastapi import FastAPI, Request
import time
REQUESTS = Counter(
"http_requests_total",
"HTTP requests",
["method", "path", "status"], # path — pattern, не реальный URL!
)
LATENCY = Histogram(
"http_request_duration_seconds",
"HTTP request latency",
["method", "path"],
buckets=(0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0),
)
app = FastAPI()
app.mount("/metrics", make_asgi_app())
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
start = time.perf_counter()
response = await call_next(request)
elapsed = time.perf_counter() - start
route = request.scope.get("route")
path = route.path if route else "unknown" # pattern, не raw URL
REQUESTS.labels(request.method, path, response.status_code).inc()
LATENCY.labels(request.method, path).observe(elapsed)
return response
@app.get("/health")
def health():
return {"status": "ok"}
RED/USE метод
RED для request-driven сервисов:
- Rate (RPS)
- Errors (error rate)
- Duration (P50/P95/P99)
USE для ресурсов:
- Utilization (% занято)
- Saturation (очередь)
- Errors
Эти 6 метрик — минимальный must-have дашборд.
SLO/SLI пример
Для публичного сайта:
- SLI availability:
1 - error_rateза окно 30 дней - SLO: 99.5% (≈ 3.6h downtime/мес — реалистично для self-hosted)
- Error budget: 0.5%
# SLO availability за 30 дней
1 - (
sum(rate(traefik_service_requests_total{code=~"5..", service="myapp"}[30d]))
/
sum(rate(traefik_service_requests_total{service="myapp"}[30d]))
)
Антипаттерны
- Метить метрику user_id, request_id, raw URL — взрыв cardinality (Prometheus умрёт на 1M+ серий).
- Алерт на всё подряд — alert fatigue, перестанут читать.
- Алерт «CPU > 80%» сам по себе — это симптом, не проблема. Алертить надо на user-facing impact (latency, errors).
- Дашборд из 50 графиков — никто не читает. Один экран = 6-10 ключевых метрик.
- Sampling traces без головы — теряются редкие медленные запросы.
- Логи без структуры (plain text) — невозможно агрегировать.
- Loki без retention → диск умрёт.
- Grafana с дефолтным admin/admin — публичный доступ = катастрофа.
- Push в Pushgateway долгоживущих метрик — теряется state на restart.
Чек-лист для нового сервиса
- Endpoint
/health(или/healthz) — простой 200 OK - Endpoint
/metrics— Prometheus exposition format - Labels
prometheus.scrape=trueиprometheus.port=...в Docker-сервисе - Blackbox probe в
prometheus.ymlдля публичных URL - Alerts:
ServiceDown,HighErrorRate,HighLatency,CertExpiringSoon - Grafana dashboard provisioned (JSON в git)
- Логи структурированные JSON, отправляются в Loki
- SLO документирован в
creator/obsidian-vault/claude/memory/observability/
Команды
# Reload Prometheus без рестарта
curl -X POST http://prom.abelentsev.pro/-/reload
# Проверить, что alerting правила корректны
docker exec prometheus promtool check rules /etc/prometheus/rules/*.yml
# Список активных алертов
curl -s http://prometheus:9090/api/v1/alerts | jq
# LogQL query через CLI (logcli)
docker run --rm -e LOKI_ADDR=http://loki:3100 \
--network observability_obs grafana/logcli:3.3.2 \
query '{job="traefik"} |= "status=500"' --limit=100
# Размер Loki storage
docker exec loki du -sh /loki/chunks
# Проверить scrape targets
curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
Интеграция с инфрой пользователя
- Traefik:
--metrics.prometheus=trueв static config, скрейпится Prometheus. - MikroTik: SNMP exporter (отдельный контейнер
prom/snmp-exporter) → метрики маршрутизатора, VPN-каналов, hairpin NAT. - PostgreSQL:
postgres_exportersidecar к каждой БД. - 1С:
windows_exporterна192.168.7.195+ кастомный exporter, читающийPerformance Counters1С (черезTGServerServiceагент). - Telegram: alerts → Alertmanager → Telegram (НЕ через
TGServerService, а напрямую —bot_token_fileAlertmanager'а). - Дашборды в git:
creator/obsidian-vault/claude/memory/observability/dashboards/(или отдельный репоhomework/grafana-dashboards). - n8n: weekly Telegram-сводка SLO/error budget по сервисам.