混沌工程原理（PRINCIPLES OFCHAOS ENGINEERING）

顾翔

发布于 2022-04-04 13:28:57

3880

发布于 2022-04-04 13:28:57

文章被收录于专栏：啄木鸟软件测试啄木鸟软件测试

原文

PRINCIPLES OFCHAOS ENGINEERING

https://principlesofchaos.org/

Last Update: 2019 March (changes)

Chaos Engineering is the discipline of experimenting on asystem in order to build confidence in the system’s capability to withstandturbulent conditions in production.

Advances in large-scale, distributed software systems arechanging the game for software engineering. As an industry, we are quick toadopt practices that increase flexibility of development and velocity ofdeployment. An urgent question follows on the heels of these benefits: How muchconfidence we can have in the complex systems that we put into production?

Even when all of the individual services in a distributedsystem are functioning properly, the interactions between those services cancause unpredictable outcomes. Unpredictable outcomes, compounded by rare butdisruptive real-world events that affect production environments, make thesedistributed systems inherently chaotic.

We need to identify weaknesses before they manifest insystem-wide, aberrant behaviors. Systemic weaknesses could take the form of:improper fallback settings when a service is unavailable; retry storms fromimproperly tuned timeouts; outages when a downstream dependency receives toomuch traffic; cascading failures when a single point of failure crashes; etc.We must address the most significant weaknesses proactively, before they affectour customers in production. We need a way to manage the chaos inherent inthese systems, take advantage of increasing flexibility and velocity, and haveconfidence in our production deployments despite the complexity that theyrepresent.

An empirical, systems-based approach addresses the chaosin distributed systems at scale and builds confidence in the ability of thosesystems to withstand realistic conditions. We learn about the behavior of adistributed system by observing it during a controlled experiment. We call this Chaos Engineering.

CHAOSIN PRACTICE

To specifically address the uncertainty of distributedsystems at scale, Chaos Engineering can be thought of as the facilitation ofexperiments to uncover systemic weaknesses. These experiments follow foursteps:

Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
Hypothesize that this steady state will continue in both the control group and the experimental group.
Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

The harder it is to disrupt the steady state, the moreconfidence we have in the behavior of the system. If a weakness is uncovered,we now have a target for improvement before that behavior manifests in thesystem at large.

ADVANCEDPRINCIPLES

The following principles describe an ideal application ofChaos Engineering, applied to the processes of experimentation described above.The degree to which these principles are pursued strongly correlates to theconfidence we can have in a distributed system at scale.

Builda Hypothesis around Steady State Behavior

Focus on the measurable output of a system, rather thaninternal attributes of the system. Measurements of that output over a shortperiod of time constitute a proxy for the system’s steady state. The overallsystem’s throughput, error rates, latency percentiles, etc. could all bemetrics of interest representing steady state behavior. By focusing on systemicbehavior patterns during experiments, Chaos verifies that the system does work,rather than trying to validate how it works.

VaryReal-world Events

Chaos variables reflect real-world events. Prioritizeevents either by potential impact or estimated frequency. Consider events thatcorrespond to hardware failures like servers dying, software failures likemalformed responses, and non-failure events like a spike in traffic or ascaling event. Any event capable of disrupting steady state is a potentialvariable in a Chaos experiment.

RunExperiments in Production

Systems behave differently depending on environment andtraffic patterns. Since the behavior of utilization can change at any time,sampling real traffic is the only way to reliably capture the request path. Toguarantee both authenticity of the way in which the system is exercised andrelevance to the current deployed system, Chaos strongly prefers to experimentdirectly on production traffic.

AutomateExperiments to Run Continuously

Running experiments manually is labor-intensive andultimately unsustainable. Automate experiments and run them continuously. ChaosEngineering builds automation into the system to drive both orchestration andanalysis.

MinimizeBlast Radius

Experimenting in production has the potential to causeunnecessary customer pain. While there must be an allowance for some short-termnegative impact, it is the responsibility and obligation of the Chaos Engineerto ensure the fallout from experiments are minimized and contained.

Chaos Engineering is a powerful practice that is alreadychanging how software is designed and engineered at some of the largest-scaleoperations in the world. Where other practices address velocity andflexibility, Chaos specifically tackles systemic uncertainty in thesedistributed systems. The Principles of Chaos provide confidence to innovate quicklyat massive scales and give customers the high quality experiences they deserve.

Join the ongoing discussion of the Principles of Chaosand their application in the Chaos Community.

译文（仅供参考）

最后更新：2019年3月（变更）

混沌工程是一门在系统上进行实验的学科，目的是建立对系统在生产中承受动荡条件的能力的信心。

大规模分布式软件系统的发展正在改变软件工程的游戏规则。作为一个行业，我们很快就会采用一些做法，以提高开发的灵活性和部署的速度。在这些好处之后，一个紧迫的问题随之而来：我们对投入生产的复杂系统有多大信心？

即使分布式系统中的所有单个服务都正常运行，这些服务之间的交互也可能导致不可预测的结果。不可预测的结果，再加上影响生产环境的罕见但具有破坏性的现实世界事件，使这些分布式系统天生混乱。

我们需要在弱点在系统范围内的异常行为中表现出来之前找出它们。系统性弱点可能表现为：当服务不可以用时，错误的回退设置；重试调整不当的超时导致的风暴；当下游依赖项收到过多流量时中断；单点故障崩溃时的级联故障；我们必须在影响客户生产之前，主动解决最重要的弱点。我们需要一种方法来管理这些系统固有的混乱，利用不断增加的灵活性和速度，并对我们的生产部署充满信心，尽管它们代表着复杂性。

一种基于系统的经验方法解决了大规模分布式系统中的混沌问题，并建立了对这些系统承受现实条件能力的信心。我们通过在受控实验中观察分布式系统来了解其行为。我们称之为混沌工程。

实践中的混乱

为了在一定程度上解决分布式系统的不确定性，混沌工程可以被认为是通过实验来发现系统弱点的便利。这些实验遵循四个步骤：

1.首先将“稳态”定义为表明正常行为的系统的一些可测量输出。

2.假设控制组和实验组的这种稳定状态都将持续。

3.引入反映真实世界事件的变量，如服务器崩溃、硬盘故障、网络连接中断等。

4.试图通过寻找控制组和实验组之间稳态的差异来反驳该假设。

破坏稳定状态的难度越大，我们对系统行为的信心就越大。如果发现了一个弱点，我们现在就有了一个改进的目标，在这种行为在整个系统中表现出来之前。

先进原则

以下原则描述了混沌工程的理想应用，应用于上述实验过程。对这些原则的追求程度与我们在大规模分布式系统中的信心密切相关。

围绕稳态行为建立一个假设

关注系统的可测量输出，而不是系统的内部属性。在短时间内对该输出的测量构成了系统稳态的代表。整个系统的吞吐量、错误率、延迟百分比等都可能是代表稳态行为的重要指标。通过在实验中关注系统的行为模式，混沌验证了系统的工作，而不是试图验证它是如何工作的。

不同的真实世界事件

混沌变量反映真实世界的事件。根据潜在影响或估计频率对事件进行优先排序。考虑对应于硬件故障的事件，如服务器死亡、软件故障如畸形响应和非故障事件，如流量尖峰或缩放事件。在混沌实验中，任何能够破坏稳态的事件都是一个潜在变量。

在生产中进行实验

系统的行为因环境和交通模式而异。由于利用率的行为随时都可能发生变化，因此对真实流量进行采样是可靠捕获请求路径的唯一方法。为了保证系统运行方式的真实性以及与当前部署的系统的相关性，混沌强烈倾向于直接在生产流量上进行实验。

使实验自动化以连续运行

手动进行实验是劳动密集型的，最终是不可持续的。自动化实验并持续运行。混沌工程将自动化构建到系统中，以驱动编排和分析。

最小化爆炸半径

在生产中进行试验可能会给客户带来不必要的痛苦。虽然必须考虑到一些短期的负面影响，但混沌工程师有责任和义务确保实验的影响最小化并得到控制。

混沌工程是一种强大的实践，它已经改变了软件在世界上一些最大规模运营中的设计和工程方式。当其他实践涉及速度和灵活性时，混沌专门解决这些分布式系统中的系统不确定性。混沌原理为快速大规模创新提供了信心，并为客户提供他们应得的高质量体验。

加入持续讨论的混沌原理及其在混沌社区中的应用。

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2022-03-04，如有侵权请联系 cloudcommunity@tencent.com 删除

分布式

linux

本文分享自软件测试培训微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体分享计划，欢迎热爱写作的你一起参与！

分布式

linux

登录后参与评论

0 条评论

热度

混沌工程原理（PRINCIPLES OFCHAOS ENGINEERING）

混沌工程原理（PRINCIPLES OFCHAOS ENGINEERING）

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐