云原生应用需要处理 云中很容易出现瞬时故障。原因在以下文档 暂时性故障处理[1] 中有具体说明。
任何环境、任何平台或操作系统以及任何类型的应用程序都会发生暂时性故障。 在本地基础结构上运行的解决方案中,应用程序及其组件的性能与可用性通常是通过昂贵但通常很少使用的硬件冗余来维持的,并且组件与资源的位置互相靠近。 尽管此方法可以大大减少故障,但可能仍会导致暂时性故障,甚至是不可预见的事件(例如外部电源或网络问题或其他灾难性的状况)造成的中断。
云托管(包括私有云系统)可以通过跨许多商品计算节点使用共享资源、冗余、自动故障转移和动态资源分配,提供更高的整体可用性。 但是,这些环境的性质意味着更可能发生暂时性故障。 原因包括:
在许多情况下,恢复和切换是在云内部完成的。如果调用者等待一段时间,然后重试,那么它很有可能会成功。因此,建议[2]在应用程序中加入重试等提高弹性的机制。
Dapr 的诞生是为了减轻开发人员开发云原生应用程序的负担。应用程序开发人员很自然地会想,“我想知道 Dapr 是否会处理与弹性相关的问题。”
即将发布的Dapr 1.7 版本 有一个组织良好的 [提案] 跨所有构建块的弹性策略[3],这篇文章目的是总结 前面介绍介绍的问题,以及为做Dapr 技术选型的同仁提供参考,此外,我在写作的另一个目的也是希望这将是你对 Dapr 的弹性产生兴趣的机会。
Dapr 当前版本是1.6,Dapr 运行时在边车(sidecar)调用端的实现是是上面这个提案要清理的主题,调用者的可恢复性需要利用Kubernetes的各种恢复功能等基础设施来保证。
调用模式分为服务到服务调用和组件调用。
如果服务调用失败,每次调用重试[4]的回退间隔是 1 秒,最多重试三次。 通过 gRPC 连接目标 sidecar 的超时时间为5秒。
组件调用取决于每个实现
至此,大家看到了问题吧,上面这个提案就是要解决下面这几个问题。
上述提案[3]讨论了解决问题的方向以及如何进行。
虽然提高弹性的机制是非常好的,但它也包括风险,比如通过重试重复写入是典型的,假如你的接口不是幂等的。提案里面考虑到两个阶段来实施,或许第一阶段在1.7版本发布后会有很多反馈。预计到这一点,阶段 1 将重点放在基本功能上。
阶段1
在第 1 阶段,建议使用以下自定义资源:请参阅 https://github.com/dapr/dapr/issues/3586
apiVersion: dapr.io/v1alpha1
kind: Resiliency
metadata:
name: daprResiliency
# Like in the Subscriptions CRD, scopes lists the Dapr App IDs that this
# configuration applies to.
scopes:
- app1
- app2
spec:
#------------------------------------------------------------------------------
# PHASE 1: Basic policy definition and applying policies to building blocks
#------------------------------------------------------------------------------
policies:
# Timeouts are simple named durations.
timeouts:
general: 5s
important: 60s
largeResponse: 10s
# Retries are named templates for and are instantiated for life of the operation.
retries:
general: {} # Sane defaults
pubsubRetry:
policy: constant
duration: 5s
maxRetries: 10
retryForever:
policy: exponential
maxInterval: 15s
maxRetries: 0 # Retry indefinitely
important:
policy: constant
duration: 5s
maxRetries: 30
someOperation:
policy: exponential
maxInterval: 15s
largeResponse:
policy: constant
duration: 5s
maxRetries: 3
# Circuit breakers are automatically instantiated per component, service endpoint, and application route.
# using these settings as a template. See logic below under `buildingBlocks`.
# Circuit breakers maintain counters that can live as long as the Dapr sidecar.
circuitBreakers:
general: {} # Sane defaults
pubsubCB:
maxRequests: 1
interval: 8s
timeout: 45s
trip: consecutiveFailures > 8
# This section specifies default policies for:
# * service invocation
# * requests to components
# * events sent to routes
buildingBlocks:
services:
appB:
timeout: general
retry: general
# Circuit breakers for services are scoped per endpoint (e.g. hostname + port).
# When a breaker is tripped, that route is removed from load balancing for the configured `timeout` duration.
circuitBreaker: general
actors:
myActorType:
timeout: general
retry: general
# Circuit breakers for actors are scoped by type, id, or both.
# When a breaker is tripped, that type or id is removed from the placement table for the configured `timeout` duration.
circuitBreaker: general
circuitBreakerScope: both
circuitBreakerCacheSize: 5000
components:
# For state stores, policies apply to saving and retrieving state.
# Watching, which is not implemented yet, is out of scope.
statestore1:
timeout: general
retry: general
# Circuit breakers for components are scoped per component configuration/instance (e.g. redis1).
# When this breaker is tripped, all interaction to that component is prevented for the configured `timeout` duration.
circuitBreaker: general
# For Pub/Sub, policies apply only to publishing.
# Subscribing/consuming is handled by routes.
pubsub1:
retry: pubsubRetry
circuitBreaker: pubsubCB
pubsub2:
retry: pubsubRetry
circuitBreaker: pubsubCB
# Routes represent the application's URI/paths that receive incoming events from both
# PubSub and Input binding components. The route values correspond to the value configured
# in the Subscription configuration or programmatic call. For input bindings, this is
# configured in the component metadata via #3566.
routes:
'dsstatus.v3':
timeout: general
retry: general
circuitBreaker: general
#------------------------------------------------------------------------------
# PHASE 2: Overriding policies for specific Dapr API operations
#------------------------------------------------------------------------------
apis:
invoke:
- match: appId == "appB"
# Nested rules: Prevent duplicative checks in rules.
# Its likely that controler-gen does not support this
# but apiextensionsv1.JSON can be used as workaround.
rules:
- match:
request.method == "GET" &&
request.metadata.count > 1000
timeout: largeResponse
retry: largeResponse
- match:
request.path == "/someOperation"
retry: someOperation
publish:
- match: |
event.type == "important.event.v1"
timeout: important
retry: important
# subscribe is when Dapr attempts to deliver a pubsub event to an application route.
subscribe:
- match: |
event.type == "important.event.v1"
timeout: important
retry: retryForever
Default resiliency.yml
that could be installed by dapr init
. Might include commented out examples for Redis.
apiVersion: dapr.io/v1alpha1
kind: Resiliency
metadata:
name: daprResiliency
scopes:
spec:
policies:
# Timeouts are simple named durations.
timeouts:
general: 5s
important: 60s
largeResponse: 10s
# Retries are named templates for and are instantiated for life of the operation.
retries:
general: {} # Sane defaults
# Circuit breakers are automatically instantiated per component, service endpoint, and application route.
# using these settings as a template. See logic below under `buildingBlocks`.
# Circuit breakers maintain counters that can live as long as the Dapr sidecar.
circuitBreakers:
general: {} # Sane defaults
# This section specifies default policies for:
# * service invocation
# * requests to components
# * events sent to routes
buildingBlocks:
services:
actors:
components:
# Routes represent the application's URI/paths that receive incoming events from both
# PubSub and Input binding components. The route values correspond to the value configured
# in the Subscription declarative or programmatic configurations.
routes:
apis:
invoke:
publish:
subscribe:
定义超时、重试和断路器策略,并将它们分配给构成构建块的服务和组件。您可以根据您的目的创建策略,例如确定超时之前的时间以及固定/成倍增加的重试间隔。
目前,阶段 1 的目标是在计划于2022/3发布的 Dapr v1.7 中发布。当然,您必须支持每个组件包才能使用此自定义资源,但这是重要的第一步。第二阶段根据第 1 阶段的反馈,此时计划使用特定于 API 的策略定义的覆盖。
基于此,如果你现在使用 Dapr,我认为你应该注意以下几点。
从Dapr 1.0 发布以来的每个版本都在不断改进Dapr 的各项功能,Dapr 大约每两个月进行一次次要版本升级,这个月将发布1.7 版本,具体参见 https://github.com/dapr/dapr/issues/4170。
[2]可靠性模式: https://docs.microsoft.com/zh-cn/azure/architecture/framework/resiliency/reliability-patterns
[3][提案] 跨所有构建块的弹性策略: https://github.com/dapr/docs/issues/2059