基于aws Cloudwatch创建监控和告警后,可以将告警信息结合SNS主题和lambda函数发送通知到告警群,比如钉钉、企业微信、飞书等等。本篇我们就详细介绍下如何将Cloudwatch告警信息推送到告警群,以飞书为例。
如果已经有告警群,跳过此步骤
复制webhook地址备用:
https://open.feishu.cn/open-apis/bot/v2/hook/xxx
选择python环境。
import requests
import json
import os
def lambda_handler(event,context):
message = event['Records'][0]['Sns']
Timestamp = message['Timestamp']
Subject = message['Subject']
sns_message = message['Message']
region = message['TopicArn'].split(':')[-3]
new_state_reason = sns_message['NewStateReason']
if "ALARM" in Subject:
title = '[AI生产环境] 警报!!'
elif "OK" in Subject:
title = '[AI生产环境] 故障恢复!'
else:
title = '[AI生产环境] 警报状态异常'
content = "**【详情信息】**\n" \
+ "**时间**: " + Timestamp + "\n" \
+ "**内容**: " + Subject + "\n" \
+ "**状态**: {old} => {new}".format(old=sns_message['OldStateValue'],
new=sns_message['NewStateValue']) + "\n" \
+ "**AWS区域**: " + sns_message['Region'] + "\n" \
+ "**监控资源对象**: " + sns_message['Trigger']['Namespace'] + "\n" \
+ "**监控指标**: " + sns_message['Trigger']['MetricName'] + "\n" \
+ "**报警名称**: " + sns_message['AlarmName'] + "\n" \
+ "**报警创建方式**: " + sns_message['AlarmDescription'] + "\n" \
+ "**报警细节**: " + new_state_reason
data_alert = {
"msg_type": 'interactive',
"card": {
"config": {
"wide_screen_mode": True
},
"header": {
"template": "red",
"title": {
"tag": "plain_text",
"content": title
}
},
"elements": [
{
"tag": "div",
"text": {
"tag": "lark_md",
"content": content
}
},
]
}
}
data_recover = {
"msg_type": 'interactive',
"card": {
"config": {
"wide_screen_mode": True
},
"header": {
"template": "green",
"title": {
"tag": "plain_text",
"content": title
}
},
"elements": [
{
"tag": "div",
"text": {
"tag": "lark_md",
"content": content
}
},
]
}
}
try:
if "OK" in Subject:
response = requests.post(
"https://open.feishu.cn/open-apis/bot/v2/hook/xxx",
json=data_recover)
else:
response = requests.post(
"https://open.feishu.cn/open-apis/bot/v2/hook/xxx",
json=data_alert)
print(response)
print(response.json())
except Exception as e:
print("err=" + str(e))
return {"error": e}
此处发送请求的地址就是前边飞书机器人的webhook地址。
sns接收cloudwatch告警事件转发给lambda函数的数据结构可参考:
https://docs.aws.amazon.com/zh_cn/lambda/latest/dg/with-sns.html
{
"Records": [
{
"EventVersion": "1.0",
"EventSubscriptionArn": "arn:aws:sns:us-east-1:123456789012:sns-lambda:21be56ed-a058-49f5-8c98-aedd2564c486",
"EventSource": "aws:sns",
"Sns": {
"SignatureVersion": "1",
"Timestamp": "2019-01-02T12:45:07.000Z",
"Signature": "tcc6faL2yUC6dgZdmrwh1Y4cGa/ebXEkAi6RibDsvpi+tE/1+82j...65r==",
"SigningCertUrl": "https://sns.us-east-1.amazonaws.com/SimpleNotificationService-ac565b8b1a6c5d002d285f9598aa1d9b.pem",
"MessageId": "95df01b4-ee98-5cb9-9903-4c221d41eb5e",
"Message": {
"AlarmName":"test-cpu-alarm",
"AlarmDescription":"从 EC2 控制台创建",
"AWSAccountId":"382497278384",
"AlarmConfigurationUpdatedTimestamp":"2023-05-08T14:25:48.891+0000",
"NewStateValue":"ALARM",
"NewStateReason":"Threshold Crossed: 1 datapoint [10.513693396388868 (08/05/23 15:04:00)] was greater than or equal to the threshold (10.0).",
"StateChangeTime":"2023-05-08T15:09:11.111+0000",
"Region":"Asia Pacific (Singapore)",
"AlarmArn":"arn:aws:cloudwatch:ap-southeast-1:382497278384:alarm:test-cpu-alarm",
"OldStateValue":"OK",
"OKActions":[
],
"AlarmActions":[
"arn:aws:sns:ap-southeast-1:382497278384:test_ec2_alarm_topic"
],
"InsufficientDataActions":[
],
"Trigger":{
"MetricName":"CPUUtilization",
"Namespace":"AWS/EC2",
"StatisticType":"Statistic",
"Statistic":"AVERAGE",
"Unit":null,
"Dimensions":[
{
"value":"i-0e6969aba46132ce4",
"name":"InstanceId"
}
],
"Period":300,
"EvaluationPeriods":1,
"ComparisonOperator":"GreaterThanOrEqualToThreshold",
"Threshold":10,
"TreatMissingData":"",
"EvaluateLowSampleCountPercentile":""
}
},
"MessageAttributes": {
"Test": {
"Type": "String",
"Value": "TestString"
},
"TestBinary": {
"Type": "Binary",
"Value": "TestBinary"
}
},
"Type": "Notification",
"UnsubscribeUrl": "https://sns.us-east-1.amazonaws.com/?Action=Unsubscribe&SubscriptionArn=arn:aws:sns:us-east-1:123456789012:test-lambda:21be56ed-a058-49f5-8c98-aedd2564c486",
"TopicArn": "arn:aws:sns:us-east-1:123456789012:sns-lambda",
"Subject": "TestInvoke"
}
}
]
}
测试发送:
能够发送成功,说明python脚本没问题。
协议选择aws lambda,然后终端节点选择刚刚创建的lambda函数。
截止到前一步,sns订阅事件会推送给lambda函数执行python脚本,但是并没有事件源。
需要创建监控指标,并且设置触发规则,然后和sns关联起来。
1.EC2
选择创建警报,并配置产生警报时发送到sns主题,这里选择我们刚刚创建的主题。
配置报警指标和阈值,比较常见的有cpu使用率、内存使用率、磁盘使用率等,然后配置百分比阈值,也就是超过这个阈值会触发告警。
切到日志和事件,创建警报。
对于主库,主要监控cpu使用率(可以选择监控写入延迟、读取延迟、写入吞吐量和读取吞吐量等指标)。 对于从库,除了cpu外可以监控副本同步延迟指标:
上述配置表示主从同步延迟5秒钟持续5分钟就会触发告警。
选择集群或者某个节点的cpu使用率指标,然后配置阈值,超过阈值后发送通知给指定sns,触发lambda调用告警通知到飞书机器人。
https://aws.amazon.com/cn/blogs/china/enterprise-wechat-and-dingtalk-receiving-amazon-cloudwatch-alarms/
https://docs.aws.amazon.com/zh_cn/lambda/latest/dg/with-sns.html
https://blog.51cto.com/wutengfei/4361109 https://blog.51cto.com/wutengfei/4556905
本文分享自 PersistentCoder 微信公众号,前往查看
如有侵权,请联系 cloudcommunity@tencent.com 删除。
本文参与 腾讯云自媒体同步曝光计划 ,欢迎热爱写作的你一起参与!