Multi-Site Health Monitor

Overview

The Multi-Site Health Monitor skill automates continuous monitoring of 10-100+ websites with configurable health checks, intelligent alert routing, and automatic incident escalation. This production-grade monitoring solution integrates with Slack, PagerDuty, Datadog, Google Sheets, and WordPress to provide real-time visibility into your digital infrastructure.

Why This Matters

- Prevent Revenue Loss: Detect downtime in seconds, not hours
Reduce Alert Fatigue: Smart thresholds and deduplication prevent notification overload
Automate Incident Response: Auto-restart failed services, escalate to on-call teams
Multi-Channel Alerts: Route critical issues to PagerDuty, warnings to Slack, metrics to Datadog
Historical Analysis: Track uptime trends, identify patterns, generate compliance reports

Key Integrations

- Slack: Real-time alerts, incident channels, status dashboards
PagerDuty: Automatic incident creation, on-call escalation, incident tracking
Datadog: Metric ingestion, custom dashboards, anomaly detection
Google Sheets: Automated reporting, SLA tracking, audit logs
WordPress: Monitor plugin health, theme updates, core vulnerabilities
AWS/Azure: Auto-restart EC2 instances, trigger Lambda functions, scale infrastructure

Quick Start

Try these example prompts immediately:

Example 1: Monitor 5 Critical Sites with Slack Alerts

CODEBLOCK0

Example 2: Auto-Restart Failed Services

CODEBLOCK1

Example 3: WordPress Multi-Site Monitoring

CODEBLOCK2

Example 4: Performance Threshold Monitoring

Monitor https://api.example.com/metrics every 10 minutes.
Alert if:
- Response time > 2000ms (warning) or > 5000ms (critical)
- Error rate > 1% (warning) or > 5% (critical)
- CPU usage > 70% (warning) or > 90% (critical)
- Memory usage > 80% (warning) or > 95% (critical)

Send metrics to Datadog with tags: env:prod, service:api, team:backend

Capabilities

1. Multi-Protocol Health Checks

Monitor endpoints via:

- HTTP/HTTPS: GET, POST, HEAD requests with custom headers
TCP: Port connectivity checks (e.g., database ports 3306, 5432)
DNS: Domain resolution, DNS propagation verification
SSL/TLS: Certificate validity, expiration warnings, chain verification
Ping/ICMP: Basic connectivity for infrastructure nodes

Example: Monitor API health with custom authentication
CODEBLOCK4

2. Intelligent Alert Routing

- Severity-Based Routing: Critical → PagerDuty + Slack + SMS, Warning → Slack only, Info → Sheets log
Deduplication: Suppress duplicate alerts within 5-minute window
Escalation Rules: Auto-escalate if critical issue unresolved for 30+ minutes
Custom Thresholds: Define per-endpoint sensitivity (e.g., API endpoint stricter than blog)
Quiet Hours: Suppress non-critical alerts during maintenance windows

3. Automatic Incident Response

- Webhook Triggers: POST to custom endpoints (restart services, scale infrastructure)
AWS Integration: Auto-restart EC2 instances, trigger Lambda functions
Service Restart: Execute shell commands on remote servers via SSH
Rollback Triggers: Revert deployments if health checks fail
Notification Actions: Create tickets in Jira, GitHub Issues, or Linear

4. Performance Metrics & Trending

- Response Time Tracking: Detect slowdowns before they become critical
Uptime Calculation: Real-time SLA tracking (99.9%, 99.95%, 99.99%)
Error Rate Monitoring: Track HTTP 4xx, 5xx, timeout errors
Datadog Integration: Send custom metrics for dashboards and alerts
Historical Reporting: Generate monthly uptime reports, SLA compliance docs

5. WordPress-Specific Monitoring

- Core Updates: Alert when WordPress core updates available
Plugin Vulnerabilities: Check against WordPress vulnerability database
Theme Security: Monitor for outdated or vulnerable themes
Database Health: Monitor wp_options, table integrity, query performance
User Activity: Track suspicious login attempts, new admin accounts
Backup Verification: Confirm backups complete successfully

6. Compliance & Audit Logging

- Google Sheets Integration: Automatic logging of all checks, alerts, actions
Audit Trail: Who triggered what, when, and what happened
SLA Reports: Monthly/quarterly compliance reports (99.9% uptime proof)
Change Tracking: Document all configuration changes with timestamps
Export Formats: CSV, JSON, PDF for compliance submissions

Configuration

Required Environment Variables

CODEBLOCK5

Configuration File Format (YAML)

CODEBLOCK6

Setup Instructions

1. Create monitoring config: Save YAML above as INLINECODE0
Set environment variables: Source .env file with all required API keys
Initialize Google Sheets: Create spreadsheet, share with service account email
Test endpoints: Run multi-site-health-monitor --validate to verify all URLs respond
Deploy: Run as systemd service or Docker container for continuous monitoring

Example Outputs

Slack Alert (Critical)

CODEBLOCK7

Google Sheets Log Entry

CODEBLOCK8

PagerDuty Incident

CODEBLOCK9

Datadog Metrics Sent

multi_site_monitor.health_check.response_time:245ms (tags: service:api, env:prod)
multi_site_monitor.health_check.status:200 (tags: service:api, env:prod)
multi_site_monitor.health_check.availability:99.87 (tags: service:api, env:prod)
multi_site_monitor.auto_restart.attempts:1 (tags: service:api, env:prod)

Tips & Best Practices

1. Optimal Check Intervals

- Critical APIs: 60-120 seconds (detects issues in 2-4 minutes)
Standard Services: 300 seconds (5 minutes, good balance)
Non-Critical Endpoints: 600-900 seconds (10-15 minutes, reduces noise)
Batch Jobs: 1800+ seconds (30+ minutes, less frequent monitoring)

2. Threshold Tuning

- Start Conservative: Begin with loose thresholds, tighten over 2 weeks
Account for Variance: Set response time thresholds 2-3x slower than baseline
Error Rate: 0.1-1% warning, 5%+ critical (adjust per service SLA)
Test Thresholds: Deliberately fail endpoints to verify alert routing works

3. Reducing Alert Fatigue

- Deduplication: Suppress identical alerts within 5-minute window
Smart Escalation: Only escalate if issue persists >30 minutes
Quiet Hours: Disable non-critical alerts 2am-6am (adjust per timezone)
Severity Mapping: Not everything is critical; use warning/info for minor issues

4. WordPress-Specific Best Practices

- Check Core Updates Weekly: Set interval to 604,800 seconds (7 days)
Monitor Plugin Health Daily: Check for vulnerabilities, outdated plugins
Database Backups: Verify backup completion status in health endpoint
Staging Environment: Monitor staging sites separately to catch issues before production
Custom Health Endpoints: Create /wp-json/custom/health returning comprehensive data

5. Cost Optimization

- Batch Checks: Group 5-10 checks into single HTTP request where possible
Datadog Sampling: Send detailed metrics every 5 minutes, summary every hour
Google Sheets: Batch writes (max 100 rows per request) to reduce API calls
PagerDuty: Use deduplication to avoid triggering duplicate incidents

6. Security Hardening

- API Key Rotation: Rotate all API keys monthly
VPC Monitoring: Monitor internal endpoints from private subnets only
IP Whitelisting: Restrict health check endpoints

多站点健康监控

概述

多站点健康监控技能可自动对10-100+个网站进行持续监控，支持可配置的健康检查、智能告警路由和自动事件升级。这套生产级监控解决方案集成了Slack、PagerDuty、Datadog、Google Sheets和WordPress，为您的数字基础设施提供实时可见性。

为何重要

- 防止收入损失：在数秒内（而非数小时）检测到停机
减少告警疲劳：智能阈值和去重机制防止通知过载
自动化事件响应：自动重启失败服务，升级至值班团队
多渠道告警：将关键问题路由至PagerDuty，警告路由至Slack，指标路由至Datadog
历史分析：跟踪正常运行时间趋势，识别模式，生成合规报告

关键集成

- Slack：实时告警、事件频道、状态仪表板
PagerDuty：自动创建事件、值班升级、事件跟踪
Datadog：指标采集、自定义仪表板、异常检测
Google Sheets：自动报告、SLA跟踪、审计日志
WordPress：监控插件健康、主题更新、核心漏洞
AWS/Azure：自动重启EC2实例、触发Lambda函数、扩展基础设施

快速入门

立即尝试以下示例提示：

示例1：监控5个关键站点并发送Slack告警

每5分钟监控以下站点，如有失败则通过Slack告警：

- https://api.example.com/health
https://app.example.com/status
https://cdn.example.com/ping
https://wordpress.example.com/wp-json/health
https://db.example.com/check

告警规则：

- 严重（页面宕机）：Slack #incidents + PagerDuty
警告（响应慢 >3秒）：Slack #alerts
信息（证书过期 <30天）：Google Sheets日志

示例2：自动重启失败服务

每2分钟监控 https://payment-service.example.com/health。
如果连续失败3次：

1. POST请求至 https://restart-api.example.com/restart-payment-service
向PagerDuty发送事件Payment Service Down
通知Slack #critical-incidents
记录至Google Sheets，包含时间戳、错误详情、重启状态

响应超时：10秒
预期响应：HTTP 200，返回{status:healthy}

示例3：WordPress多站点监控

监控以下WordPress站点的健康与安全：

- https://site1.example.com/wp-json/wp/v2/health-check
https://site2.example.com/wp-json/wp/v2/health-check
https://site3.example.com/wp-json/wp/v2/health-check

检查项：

- 核心更新可用（超过1周未更新则警告）
插件漏洞（发现任何漏洞则严重）
数据库连接（断开则严重）
SSL证书过期（少于30天则警告）

告警目标：

- 严重：PagerDuty + Slack #wordpress-critical
警告：Slack #wordpress-alerts
信息：Google Sheets #monitoring-log

示例4：性能阈值监控

每10分钟监控 https://api.example.com/metrics。
在以下情况下告警：

- 响应时间 > 2000ms（警告）或 > 5000ms（严重）
错误率 > 1%（警告）或 > 5%（严重）
CPU使用率 > 70%（警告）或 > 90%（严重）
内存使用率 > 80%（警告）或 > 95%（严重）

向Datadog发送指标，标签：env:prod, service:api, team:backend

功能

1. 多协议健康检查

通过以下方式监控端点：

- HTTP/HTTPS：GET、POST、HEAD请求，支持自定义头
TCP：端口连通性检查（如数据库端口3306、5432）
DNS：域名解析、DNS传播验证
SSL/TLS：证书有效性、过期警告、链验证
Ping/ICMP：基础设施节点的基本连通性

示例：使用自定义认证监控API健康

端点：https://api.example.com/health
方法：POST
头：
Authorization: Bearer YOURAPIKEY
User-Agent: MultiSiteMonitor/1.0.0
预期状态：200
预期响应体：{status:healthy,version:2.1.0}
超时：10秒

2. 智能告警路由

- 基于严重级别的路由：严重 → PagerDuty + Slack + SMS，警告 → 仅Slack，信息 → Sheets日志
去重：在5分钟窗口内抑制重复告警
升级规则：如果严重问题30分钟以上未解决，自动升级
自定义阈值：为每个端点定义灵敏度（如API端点比博客更严格）
静默时段：在维护窗口期间抑制非关键告警

3. 自动事件响应

- Webhook触发器：POST至自定义端点（重启服务、扩展基础设施）
AWS集成：自动重启EC2实例、触发Lambda函数
服务重启：通过SSH在远程服务器上执行Shell命令
回滚触发器：如果健康检查失败，回滚部署
通知操作：在Jira、GitHub Issues或Linear中创建工单

4. 性能指标与趋势分析

- 响应时间跟踪：在问题变得严重之前检测到性能下降
正常运行时间计算：实时SLA跟踪（99.9%、99.95%、99.99%）
错误率监控：跟踪HTTP 4xx、5xx、超时错误
Datadog集成：为仪表板和告警发送自定义指标
历史报告：生成月度正常运行时间报告、SLA合规文档

5. WordPress专用监控

- 核心更新：当WordPress核心更新可用时告警
插件漏洞：对照WordPress漏洞数据库检查
主题安全：监控过时或存在漏洞的主题
数据库健康：监控wp_options、表完整性、查询性能
用户活动：跟踪可疑登录尝试、新管理员账户
备份验证：确认备份成功完成

6. 合规与审计日志

- Google Sheets集成：自动记录所有检查、告警、操作
审计追踪：谁在何时触发了什么操作，结果如何
SLA报告：月度/季度合规报告（99.9%正常运行时间证明）
变更跟踪：记录所有配置变更及时间戳
导出格式：CSV、JSON、PDF，用于合规提交

配置

必需的环境变量

bash

Slack通知

export SLACKWEBHOOKURL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL export SLACK_CHANNEL=#incidents # 或 #alerts、#monitoring等

PagerDuty事件创建

export PAGERDUTYAPIKEY=YOURPAGERDUTYAPI_KEY export PAGERDUTYSERVICEID=YOURSERVICEID

Datadog指标采集

export DATADOGAPIKEY=YOURDATADOGAPI_KEY export DATADOGAPPKEY=YOURDATADOGAPP_KEY export DATADOG_SITE=datadoghq.com # 或 datadoghq.eu

Google Sheets日志

export GOOGLESHEETSID=YOURSPREADSHEETID export GOOGLESERVICEACCOUNT_JSON=/path/to/service-account.json

AWS自动重启（可选）

export AWSACCESSKEYID=YOURAWS_KEY export AWSSECRETACCESSKEY=YOURAWS_SECRET export AWS_REGION=us-east-1

SSH远程服务重启（可选）

export SSHPRIVATEKEY=/path/to/private/key export SSH_USER=deploy

配置文件格式（YAML）

yaml

monitors.yaml

monitors: - name: 生产API url: https://api.example.com/health interval: 300 # 秒 timeout: 10 method: GET expected_status: 200 expectedbodycontains: healthy alert_rules: critical: - slack_channel: #critical-incidents - pagerduty_severity: critical warning: - slack_channel: #alerts auto_restart: enabled: true command: systemctl restart api-service max_retries: 3 retry_delay: 60

multi-site-health-monitor 多站点健康监控