Multi-Site Health Monitor
Overview
The Multi-Site Health Monitor skill automates continuous monitoring of 10-100+ websites with configurable health checks, intelligent alert routing, and automatic incident escalation. This production-grade monitoring solution integrates with Slack, PagerDuty, Datadog, Google Sheets, and WordPress to provide real-time visibility into your digital infrastructure.
Why This Matters
- - Prevent Revenue Loss: Detect downtime in seconds, not hours
- Reduce Alert Fatigue: Smart thresholds and deduplication prevent notification overload
- Automate Incident Response: Auto-restart failed services, escalate to on-call teams
- Multi-Channel Alerts: Route critical issues to PagerDuty, warnings to Slack, metrics to Datadog
- Historical Analysis: Track uptime trends, identify patterns, generate compliance reports
Key Integrations
- - Slack: Real-time alerts, incident channels, status dashboards
- PagerDuty: Automatic incident creation, on-call escalation, incident tracking
- Datadog: Metric ingestion, custom dashboards, anomaly detection
- Google Sheets: Automated reporting, SLA tracking, audit logs
- WordPress: Monitor plugin health, theme updates, core vulnerabilities
- AWS/Azure: Auto-restart EC2 instances, trigger Lambda functions, scale infrastructure
Quick Start
Try these example prompts immediately:
Example 1: Monitor 5 Critical Sites with Slack Alerts
CODEBLOCK0
Example 2: Auto-Restart Failed Services
CODEBLOCK1
Example 3: WordPress Multi-Site Monitoring
CODEBLOCK2
Example 4: Performance Threshold Monitoring
Monitor https://api.example.com/metrics every 10 minutes.
Alert if:
- Response time > 2000ms (warning) or > 5000ms (critical)
- Error rate > 1% (warning) or > 5% (critical)
- CPU usage > 70% (warning) or > 90% (critical)
- Memory usage > 80% (warning) or > 95% (critical)
Send metrics to Datadog with tags: env:prod, service:api, team:backend
Capabilities
1. Multi-Protocol Health Checks
Monitor endpoints via:
- - HTTP/HTTPS: GET, POST, HEAD requests with custom headers
- TCP: Port connectivity checks (e.g., database ports 3306, 5432)
- DNS: Domain resolution, DNS propagation verification
- SSL/TLS: Certificate validity, expiration warnings, chain verification
- Ping/ICMP: Basic connectivity for infrastructure nodes
Example: Monitor API health with custom authentication
CODEBLOCK4
2. Intelligent Alert Routing
- - Severity-Based Routing: Critical → PagerDuty + Slack + SMS, Warning → Slack only, Info → Sheets log
- Deduplication: Suppress duplicate alerts within 5-minute window
- Escalation Rules: Auto-escalate if critical issue unresolved for 30+ minutes
- Custom Thresholds: Define per-endpoint sensitivity (e.g., API endpoint stricter than blog)
- Quiet Hours: Suppress non-critical alerts during maintenance windows
3. Automatic Incident Response
- - Webhook Triggers: POST to custom endpoints (restart services, scale infrastructure)
- AWS Integration: Auto-restart EC2 instances, trigger Lambda functions
- Service Restart: Execute shell commands on remote servers via SSH
- Rollback Triggers: Revert deployments if health checks fail
- Notification Actions: Create tickets in Jira, GitHub Issues, or Linear
4. Performance Metrics & Trending
- - Response Time Tracking: Detect slowdowns before they become critical
- Uptime Calculation: Real-time SLA tracking (99.9%, 99.95%, 99.99%)
- Error Rate Monitoring: Track HTTP 4xx, 5xx, timeout errors
- Datadog Integration: Send custom metrics for dashboards and alerts
- Historical Reporting: Generate monthly uptime reports, SLA compliance docs
5. WordPress-Specific Monitoring
- - Core Updates: Alert when WordPress core updates available
- Plugin Vulnerabilities: Check against WordPress vulnerability database
- Theme Security: Monitor for outdated or vulnerable themes
- Database Health: Monitor wp_options, table integrity, query performance
- User Activity: Track suspicious login attempts, new admin accounts
- Backup Verification: Confirm backups complete successfully
6. Compliance & Audit Logging
- - Google Sheets Integration: Automatic logging of all checks, alerts, actions
- Audit Trail: Who triggered what, when, and what happened
- SLA Reports: Monthly/quarterly compliance reports (99.9% uptime proof)
- Change Tracking: Document all configuration changes with timestamps
- Export Formats: CSV, JSON, PDF for compliance submissions
Configuration
Required Environment Variables
CODEBLOCK5
Configuration File Format (YAML)
CODEBLOCK6
Setup Instructions
- 1. Create monitoring config: Save YAML above as INLINECODE0
- Set environment variables: Source
.env file with all required API keys - Initialize Google Sheets: Create spreadsheet, share with service account email
- Test endpoints: Run
multi-site-health-monitor --validate to verify all URLs respond - Deploy: Run as systemd service or Docker container for continuous monitoring
Example Outputs
Slack Alert (Critical)
CODEBLOCK7
Google Sheets Log Entry
CODEBLOCK8
PagerDuty Incident
CODEBLOCK9
Datadog Metrics Sent
multi_site_monitor.health_check.response_time:245ms (tags: service:api, env:prod)
multi_site_monitor.health_check.status:200 (tags: service:api, env:prod)
multi_site_monitor.health_check.availability:99.87 (tags: service:api, env:prod)
multi_site_monitor.auto_restart.attempts:1 (tags: service:api, env:prod)
Tips & Best Practices
1. Optimal Check Intervals
- - Critical APIs: 60-120 seconds (detects issues in 2-4 minutes)
- Standard Services: 300 seconds (5 minutes, good balance)
- Non-Critical Endpoints: 600-900 seconds (10-15 minutes, reduces noise)
- Batch Jobs: 1800+ seconds (30+ minutes, less frequent monitoring)
2. Threshold Tuning
- - Start Conservative: Begin with loose thresholds, tighten over 2 weeks
- Account for Variance: Set response time thresholds 2-3x slower than baseline
- Error Rate: 0.1-1% warning, 5%+ critical (adjust per service SLA)
- Test Thresholds: Deliberately fail endpoints to verify alert routing works
3. Reducing Alert Fatigue
- - Deduplication: Suppress identical alerts within 5-minute window
- Smart Escalation: Only escalate if issue persists >30 minutes
- Quiet Hours: Disable non-critical alerts 2am-6am (adjust per timezone)
- Severity Mapping: Not everything is critical; use warning/info for minor issues
4. WordPress-Specific Best Practices
- - Check Core Updates Weekly: Set interval to 604,800 seconds (7 days)
- Monitor Plugin Health Daily: Check for vulnerabilities, outdated plugins
- Database Backups: Verify backup completion status in health endpoint
- Staging Environment: Monitor staging sites separately to catch issues before production
- Custom Health Endpoints: Create
/wp-json/custom/health returning comprehensive data
5. Cost Optimization
- - Batch Checks: Group 5-10 checks into single HTTP request where possible
- Datadog Sampling: Send detailed metrics every 5 minutes, summary every hour
- Google Sheets: Batch writes (max 100 rows per request) to reduce API calls
- PagerDuty: Use deduplication to avoid triggering duplicate incidents
6. Security Hardening
- - API Key Rotation: Rotate all API keys monthly
- VPC Monitoring: Monitor internal endpoints from private subnets only
- IP Whitelisting: Restrict health check endpoints
多站点健康监控
概述
多站点健康监控技能可自动对10-100+个网站进行持续监控,支持可配置的健康检查、智能告警路由和自动事件升级。这套生产级监控解决方案集成了Slack、PagerDuty、Datadog、Google Sheets和WordPress,为您的数字基础设施提供实时可见性。
为何重要
- - 防止收入损失:在数秒内(而非数小时)检测到停机
- 减少告警疲劳:智能阈值和去重机制防止通知过载
- 自动化事件响应:自动重启失败服务,升级至值班团队
- 多渠道告警:将关键问题路由至PagerDuty,警告路由至Slack,指标路由至Datadog
- 历史分析:跟踪正常运行时间趋势,识别模式,生成合规报告
关键集成
- - Slack:实时告警、事件频道、状态仪表板
- PagerDuty:自动创建事件、值班升级、事件跟踪
- Datadog:指标采集、自定义仪表板、异常检测
- Google Sheets:自动报告、SLA跟踪、审计日志
- WordPress:监控插件健康、主题更新、核心漏洞
- AWS/Azure:自动重启EC2实例、触发Lambda函数、扩展基础设施
快速入门
立即尝试以下示例提示:
示例1:监控5个关键站点并发送Slack告警
每5分钟监控以下站点,如有失败则通过Slack告警:
- - https://api.example.com/health
- https://app.example.com/status
- https://cdn.example.com/ping
- https://wordpress.example.com/wp-json/health
- https://db.example.com/check
告警规则:
- - 严重(页面宕机):Slack #incidents + PagerDuty
- 警告(响应慢 >3秒):Slack #alerts
- 信息(证书过期 <30天):Google Sheets日志
示例2:自动重启失败服务
每2分钟监控 https://payment-service.example.com/health。
如果连续失败3次:
- 1. POST请求至 https://restart-api.example.com/restart-payment-service
- 向PagerDuty发送事件Payment Service Down
- 通知Slack #critical-incidents
- 记录至Google Sheets,包含时间戳、错误详情、重启状态
响应超时:10秒
预期响应:HTTP 200,返回{status:healthy}
示例3:WordPress多站点监控
监控以下WordPress站点的健康与安全:
- - https://site1.example.com/wp-json/wp/v2/health-check
- https://site2.example.com/wp-json/wp/v2/health-check
- https://site3.example.com/wp-json/wp/v2/health-check
检查项:
- - 核心更新可用(超过1周未更新则警告)
- 插件漏洞(发现任何漏洞则严重)
- 数据库连接(断开则严重)
- SSL证书过期(少于30天则警告)
告警目标:
- - 严重:PagerDuty + Slack #wordpress-critical
- 警告:Slack #wordpress-alerts
- 信息:Google Sheets #monitoring-log
示例4:性能阈值监控
每10分钟监控 https://api.example.com/metrics。
在以下情况下告警:
- - 响应时间 > 2000ms(警告)或 > 5000ms(严重)
- 错误率 > 1%(警告)或 > 5%(严重)
- CPU使用率 > 70%(警告)或 > 90%(严重)
- 内存使用率 > 80%(警告)或 > 95%(严重)
向Datadog发送指标,标签:env:prod, service:api, team:backend
功能
1. 多协议健康检查
通过以下方式监控端点:
- - HTTP/HTTPS:GET、POST、HEAD请求,支持自定义头
- TCP:端口连通性检查(如数据库端口3306、5432)
- DNS:域名解析、DNS传播验证
- SSL/TLS:证书有效性、过期警告、链验证
- Ping/ICMP:基础设施节点的基本连通性
示例:使用自定义认证监控API健康
端点:https://api.example.com/health
方法:POST
头:
Authorization: Bearer YOURAPIKEY
User-Agent: MultiSiteMonitor/1.0.0
预期状态:200
预期响应体:{status:healthy,version:2.1.0}
超时:10秒
2. 智能告警路由
- - 基于严重级别的路由:严重 → PagerDuty + Slack + SMS,警告 → 仅Slack,信息 → Sheets日志
- 去重:在5分钟窗口内抑制重复告警
- 升级规则:如果严重问题30分钟以上未解决,自动升级
- 自定义阈值:为每个端点定义灵敏度(如API端点比博客更严格)
- 静默时段:在维护窗口期间抑制非关键告警
3. 自动事件响应
- - Webhook触发器:POST至自定义端点(重启服务、扩展基础设施)
- AWS集成:自动重启EC2实例、触发Lambda函数
- 服务重启:通过SSH在远程服务器上执行Shell命令
- 回滚触发器:如果健康检查失败,回滚部署
- 通知操作:在Jira、GitHub Issues或Linear中创建工单
4. 性能指标与趋势分析
- - 响应时间跟踪:在问题变得严重之前检测到性能下降
- 正常运行时间计算:实时SLA跟踪(99.9%、99.95%、99.99%)
- 错误率监控:跟踪HTTP 4xx、5xx、超时错误
- Datadog集成:为仪表板和告警发送自定义指标
- 历史报告:生成月度正常运行时间报告、SLA合规文档
5. WordPress专用监控
- - 核心更新:当WordPress核心更新可用时告警
- 插件漏洞:对照WordPress漏洞数据库检查
- 主题安全:监控过时或存在漏洞的主题
- 数据库健康:监控wp_options、表完整性、查询性能
- 用户活动:跟踪可疑登录尝试、新管理员账户
- 备份验证:确认备份成功完成
6. 合规与审计日志
- - Google Sheets集成:自动记录所有检查、告警、操作
- 审计追踪:谁在何时触发了什么操作,结果如何
- SLA报告:月度/季度合规报告(99.9%正常运行时间证明)
- 变更跟踪:记录所有配置变更及时间戳
- 导出格式:CSV、JSON、PDF,用于合规提交
配置
必需的环境变量
bash
Slack通知
export SLACK
WEBHOOKURL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
export SLACK_CHANNEL=#incidents # 或 #alerts、#monitoring等
PagerDuty事件创建
export PAGERDUTY
APIKEY=YOUR
PAGERDUTYAPI_KEY
export PAGERDUTY
SERVICEID=YOUR
SERVICEID
Datadog指标采集
export DATADOG
APIKEY=YOUR
DATADOGAPI_KEY
export DATADOG
APPKEY=YOUR
DATADOGAPP_KEY
export DATADOG_SITE=datadoghq.com # 或 datadoghq.eu
Google Sheets日志
export GOOGLE
SHEETSID=YOUR
SPREADSHEETID
export GOOGLE
SERVICEACCOUNT_JSON=/path/to/service-account.json
AWS自动重启(可选)
export AWS
ACCESSKEY
ID=YOURAWS_KEY
export AWS
SECRETACCESS
KEY=YOURAWS_SECRET
export AWS_REGION=us-east-1
SSH远程服务重启(可选)
export SSH
PRIVATEKEY=/path/to/private/key
export SSH_USER=deploy
配置文件格式(YAML)
yaml
monitors.yaml
monitors:
- name: 生产API
url: https://api.example.com/health
interval: 300 # 秒
timeout: 10
method: GET
expected_status: 200
expected
bodycontains: healthy
alert_rules:
critical:
- slack_channel: #critical-incidents
- pagerduty_severity: critical
warning:
- slack_channel: #alerts
auto_restart:
enabled: true
command: systemctl restart api-service
max_retries: 3
retry_delay: 60