云服务器故障自愈

最近更新时间:2024-10-30 16:05:02

我的收藏

背景

自愈机制目的:帮助业务提升稳定性,实现异常的快速恢复。

故障自愈场景

自愈场景
自愈的 checklist
ping 不可达快速恢复
检测机器 ping 不可达告警/服务器重启等
检测机器恢复正常
调用脚本检查机器状态,恢复机器业务进程
其他事件相关场景
连接数超限导致丢包
外网出带宽超限导致丢包
磁盘只读
内核故障
内存 oom
实例已重启
ping 不可达
机器重启




故障自愈流程





自愈云函数和告警事件自动配置

步骤1:自愈配置参数

文件名:config.json
{
"//": "云函数和事件规则的name可保持默认值得不修改",
"scf_Name": "game_auto_recovery",
"RuleName": "game_auto_recovery",
"//": "云上自愈配置主账户uin",
"uin": 2159973417,
"//": "执行脚本的密钥id和key",
"secretID": "xxxxxxxx",
"secretKey": "xxxxxxxxx",
"//": "自愈命令工作的目录",
"work_Directory": "/data/home/user00",
"//": "自愈命令执行用户名",
"work_User": "user00",
"//": "要执行的自愈命令",
"run_Command": "nohup /home/user00/my_program.sh >/home/user00/my_test.log 2>&1 &",
"//": "自愈的服务器",
"cvm_List": "ins-k2xxx,ins-4q9xxx"
}

步骤2:执行故障自愈配置脚本

在 config.json 同目录下,插入如下脚本内容,可以命名为 create_auto_recovery.py。
# -*- coding: utf8 -*-
import hashlib, hmac, json, os, sys, time, base64
from datetime import datetime, timedelta
if sys.version_info[0] <= 2:
from httplib import HTTPSConnection
else:
from http.client import HTTPSConnection

# 计算签名摘要函数
def sign(key, msg):
return hmac.new(key, msg.encode("utf-8"), hashlib.sha256).digest()

# 计算签名信息
def authen_content(timestamp, date, secret_id, secret_key, payload, service, host, algorithm, action, params):
# ************* 步骤 1:拼接规范请求串 *************
http_request_method = "POST"
canonical_uri = "/"
canonical_querystring = ""
ct = "application/json; charset=utf-8"
canonical_headers = "content-type:%s\\nhost:%s\\nx-tc-action:%s\\n" % (ct, host, action.lower())
signed_headers = "content-type;host;x-tc-action"
hashed_request_payload = hashlib.sha256(payload.encode("utf-8")).hexdigest()
canonical_request = (http_request_method + "\\n" +
canonical_uri + "\\n" +
canonical_querystring + "\\n" +
canonical_headers + "\\n" +
signed_headers + "\\n" +
hashed_request_payload)

# ************* 步骤 2:拼接待签名字符串 *************
credential_scope = date + "/" + service + "/" + "tc3_request"
hashed_canonical_request = hashlib.sha256(canonical_request.encode("utf-8")).hexdigest()
string_to_sign = (algorithm + "\\n" +
str(timestamp) + "\\n" +
credential_scope + "\\n" +
hashed_canonical_request)

# ************* 步骤 3:计算签名 *************
secret_date = sign(("TC3" + secret_key).encode("utf-8"), date)
secret_service = sign(secret_date, service)
secret_signing = sign(secret_service, "tc3_request")
signature = hmac.new(secret_signing, string_to_sign.encode("utf-8"), hashlib.sha256).hexdigest()

# ************* 步骤 4:拼接 Authorization *************
authorization = (algorithm + " " +
"Credential=" + secret_id + "/" + credential_scope + ", " +
"SignedHeaders=" + signed_headers + ", " +
"Signature=" + signature)
return(authorization)


# 创建云函数
def CreateFunction(timestamp, date, secretID, secretKey, scf_Name, work_Directory, work_User, run_Command):
token = ""
service = "scf"
host = "%s.tencentcloudapi.com" % service
endpoint = "https://" + host
region = "ap-guangzhou"
version = "2018-04-16"
action = "CreateFunction"
payload = "{\\"FunctionName\\":\\"%s\\",\\"Code\\":{\\"DemoId\\":\\"demo-951rub7u\\"},\\"Handler\\":\\"index.main_handler\\",\\"Description\\":\\"故障自愈\\",\\"MemorySize\\":256,\\"Timeout\\":100,\\"Environment\\":{\\"Variables\\":[{\\"Key\\":\\"secretID\\",\\"Value\\":\\"%s\\"},{\\"Key\\":\\"secretKey\\",\\"Value\\":\\"%s\\"},{\\"Key\\":\\"work_Directory\\",\\"Value\\":\\"%s\\"},{\\"Key\\":\\"work_User\\",\\"Value\\":\\"%s\\"},{\\"Key\\":\\"run_Command\\",\\"Value\\":\\"%s\\"}]},\\"Runtime\\":\\"Python3.6\\",\\"Namespace\\":\\"default\\"}" % (scf_Name, secretID, secretKey,work_Directory, work_User, run_Command)

params = json.loads(payload)
algorithm = "TC3-HMAC-SHA256"

authorization = authen_content(timestamp, date, secretID, secretKey, payload, service, host, algorithm, action, params)

headers = {
"Authorization": authorization,
"Content-Type": "application/json; charset=utf-8",
"Host": host,
"X-TC-Action": action,
"X-TC-Timestamp": timestamp,
"X-TC-Version": version
}
if region:
headers["X-TC-Region"] = region
if token:
headers["X-TC-Token"] = token

try:
req = HTTPSConnection(host)
req.request("POST", "/", headers=headers, body=payload.encode("utf-8"))
resp = req.getresponse()
return(resp.read())
except Exception as err:
print(err)


# 创建规则
def CreateRule(timestamp, date, secretID, secretKey, RuleName, cvm_List):
token = ""
service = "eb"
host = "%s.tencentcloudapi.com" % service
endpoint = "https://" + host
region = "ap-guangzhou"
version = "2021-04-16"
action = "CreateRule"
cvmlist = ''
cvmlists = cvm_List.split(',')

params = {"Enable": True, "Description": "故障自愈",
"EventPattern": "{\\n \\"source\\": \\"cvm.cloud.tencent\\",\\n \\"type\\": [\\n \\"cvm:ErrorEvent:GuestReboot\\",\\n \\"cvm:ErrorEvent:PingUnreachable\\"\\n ]\\n}",
"EventBusId": "eb-df1713l8", "RuleName": "%s" % RuleName}
if cvm_List != "":
for i in range(len(cvmlists)):
if i < len(cvmlists)-1:
cvmlist = cvmlist + "\\n \\"%s\\"" % cvmlists[i] + ","
else:
cvmlist = cvmlist + "\\n \\"%s\\"" % cvmlists[i]

params = {"Enable": True, "Description": "故障自愈",
"EventPattern": "{\\n \\"source\\": \\"cvm.cloud.tencent\\",\\n \\"type\\": [\\n \\"cvm:ErrorEvent:GuestReboot\\",\\n \\"cvm:ErrorEvent:PingUnreachable\\"\\n ],\\n \\"subject\\": [%s\\n ]\\n}" % cvmlist,
"EventBusId": "eb-df1713l8", "RuleName": "%s" % RuleName}
payload = json.dumps(params)
algorithm = "TC3-HMAC-SHA256"
authorization = authen_content(timestamp, date, secretID, secretKey, payload, service, host, algorithm, action, params)

headers = {
"Authorization": authorization,
"Content-Type": "application/json; charset=utf-8",
"Host": host,
"X-TC-Action": action,
"X-TC-Timestamp": timestamp,
"X-TC-Version": version
}
if region:
headers["X-TC-Region"] = region
if token:
headers["X-TC-Token"] = token

try:
req = HTTPSConnection(host)
req.request("POST", "/", headers=headers, body=payload.encode("utf-8"))
resp = req.getresponse()
return(resp.read())
except Exception as err:
print(err)

# 创建规则目标
def CreateTarget(timestamp, date, secretID, secretKey, uin, scf_Name, RuleId):
token = ""
service = "eb"
host = "%s.tencentcloudapi.com" % service
endpoint = "https://" + host
region = "ap-guangzhou"
version = "2021-04-16"
action = "CreateTarget"
payload = "{\\"EventBusId\\":\\"eb-df1713l8\\",\\"Type\\":\\"scf\\",\\"TargetDescription\\":{\\"ResourceDescription\\":\\"qcs::scf:ap-guangzhou:uin/%s:namespace/default/function/%s/$LATEST\\"},\\"RuleId\\":\\"%s\\"}" % (
uin, scf_Name, RuleId)
params = json.loads(payload)
algorithm = "TC3-HMAC-SHA256"

authorization = authen_content(timestamp, date, secretID, secretKey, payload, service, host, algorithm, action, params)

headers = {
"Authorization": authorization,
"Content-Type": "application/json; charset=utf-8",
"Host": host,
"X-TC-Action": action,
"X-TC-Timestamp": timestamp,
"X-TC-Version": version
}
if region:
headers["X-TC-Region"] = region
if token:
headers["X-TC-Token"] = token

try:
req = HTTPSConnection(host)
req.request("POST", "/", headers=headers, body=payload.encode("utf-8"))
resp = req.getresponse()
return(resp.read())
except Exception as err:
print(err)


if __name__ == '__main__':
print("开始为您配置故障自愈.......")
timestamp = int(time.time())
date = datetime.utcfromtimestamp(timestamp).strftime("%Y-%m-%d")

# load config
with open('config.json') as file:
config = json.load(file)

uin = config['uin']
secretID = config['secretID']
secretKey = config['secretKey']
work_Directory = config['work_Directory']
work_User = config['work_User']
run_Command = config['run_Command']
scf_Name = config['scf_Name']
RuleName = config['RuleName']
cvm_List = config['cvm_List']

# 创建云函数
result_func = CreateFunction(timestamp, date, secretID, secretKey, scf_Name, work_Directory, work_User, run_Command)
if 'Error' not in result_func.decode('utf-8'):
print("故障自愈云函数初始化中,预计需要60s.")
time.sleep(60)

# 创建事件规则
RuleNameGet = CreateRule(timestamp, date, secretID, secretKey, RuleName, cvm_List)
RuleNameGet_dec = RuleNameGet.decode('utf-8')
if 'Error' not in RuleNameGet_dec:
RuleId = json.loads(RuleNameGet_dec)['Response']['RuleId']
print("服务器事件规则配置中,预计需要10s.")

# 创建事件规则目标
result_target = CreateTarget(timestamp, date, secretID, secretKey, uin, scf_Name, RuleId)
if 'Error' not in result_target.decode('utf-8'):
print("您的故障自愈已配置完成,自定义配置请参考文档https://cloud.tencent.com/document/product/248/109573。")

步骤3:执行脚本创建自愈配置

找一台 CVM 服务器,将 config.json 和自愈脚本(命名为 create_auto_recovery.py)放在相同目录下,执行 create_auto_recovery.py 自动创建自愈配置。




自愈云函数和告警事件手动配置

步骤1:云函数配置

2. 选择函数服务的区域以及命名空间,例如选择广州区域,命名空间 default,然后单击新建
3. 新建函数。选择模板创建,搜索“故障”即可找到云函数模板,详情可参见 创建函数。具体参数参考下图:



4. 设置云函数超时时间。进入函数管理 > 函数配置页面,单击编辑,在环境配置中设置云函数超时时间。
5. 云函数有4个环境变量,请设置为您需要传递的值。




步骤2:配置告警事件规则

2. 选择事件集归属与地域,例如地域广州,事件集 default。
3. 单击新建,根据页面提示填写相关信息,如下图所示。操作步骤可参见 创建事件规则
事件匹配:云服务类型选择云服务器,事件类型选择 ping 不可达
事件目标:触发方式选择云函数,然后选择自己配置的目标云函数。
触发方式:选择云函数(SCF)。
命名空间:选择云函数创建的空间,默认为 default。
版本及别名:选择您命名的云函数版本,默认为“$LATEST”。
批量投递:选择关闭




步骤3:账号权限配置

如果账号提示如下报错信息,则表示当前账号没有权限,需要添加权限。
"Error": {"Code": "ResourceNotFound.RoleNotFound", "Message": "The specified role `TAT_QCSLinkedRoleInCommand` is not found, please add the role for your account."}
添加权限的方式: 用户进入 访问管理控制台 > 角色,对应产品会检测是否有授权角色,如果没有,会引导创建角色。例如下图:



说明:
如果没有上图提示,则点击 此处 添加相应的权限。

故障自愈演练的过程

1. 进入 云顾问 > 混沌演练 > 演练管理 页面,新增服务器关机故障动作,创建步骤可参见 控制台快速上手
2. 执行服务器故障演练,模拟服务群宕机和重启的过程。



3. 故障演练告警和事件触发。
腾讯云可观测平台 > 告警治理 > 告警历史 页面查看告警信息。服务器重启后,有 2 次告警事件,1 次是宕机时触发的告警事件,1 次是宕机恢复正常后触发的告警事件。



4. 云函数部署效果。登录 Serverless > 函数服务 页面,进入 步骤一新建的函数 中,在函数代码中查看效果。



5. 云函数定时执行故障效果。在日志查询中查看函数运行日志。



6. 服务器实际自愈的效果。服务器启动后,脚本被自动拉起。




预期收益

操作内容
预计时间
提升效率
服务器异常事件 > 使用运维工具/手动启动游戏进程 > 检查进程状态
20 ~ 30 分钟
节省时间 75%左右
服务器异常事件 > 自动拉起进程和检查
5 分钟

费用计算

云函数按量计费标准可参见 云函数基本能力产品定价



实际费用计算。您可根据自身业务需求预估计费项的用量,通过 价格计算器 进行计算。