Opentelemetry——Sampling

方亮

发布于 2024-05-24 19:23:37

4430

文章被收录于专栏：方亮方亮

大纲

*Sampling*
*Terminology*
- *Head Sampling*
*Tail Sampling*
*Support*
- *Collector*
- *Language SDKs*

Sampling

采样

Learn about sampling, and the different sampling options available in OpenTelemetry. 了解采样以及 OpenTelemetry 中提供的不同采样选项。

With distributed tracing, you observe requests as they move from one service to another in a distributed system. It’s superbly practical for a number of reasons, such as understanding your service connections and diagnosing latency issues, among many other benefits. 使用分布式跟踪，您可以观察请求在分布式系统中从一个服务到另一个服务的传递情况。出于多种原因，这对于理解服务连接和诊断延迟问题非常实用。

However, if the majority of all your requests are successful 200s and finish without unacceptable latency or errors, do you really need all that data? Here’s the thing—you don’t always need a ton of data to find the right insights. You just need the right sampling of data. 但是，如果您的所有请求中的大多数都在 200 秒内成功完成并且没有出现不可接受的延迟或错误，那么您真的需要所有这些数据吗？事情是这样的——你并不总是需要大量数据才能找到正确的洞察力。您只需要正确的数据采样。

The idea behind sampling is to control the spans you send to your observability backend, resulting in lower ingest costs. Different organizations will have their own reasons for not just why they want to sample, but also what they want to sample. You might want to customize your sampling strategy to: 采样的核心思想是控制您发送到可观测后端的Span，从而降低摄取成本。不同的组织不仅有自己的采样原因，而且也有自己想要采样的对象。您可能需要自定义采样策略以实现以下目标：

Manage costs: If you have a high volume of telemetry, you risk incurring heavy charges from a telemetry backend vendor or cloud provider to export and store every span. 管理成本：如果您有大量遥测数据，则可能会因导出和存储每个Span数据，而被遥测后端供应商或云提供商收取高额费用。
Focus on interesting traces: For example, your frontend team may only want to see traces with specific user attributes. 关注有趣的Trace：例如，您的前端团队可能只想查看具有特定用户属性的Trace。
Filter out noise: For example, you may want to filter out health checks. 过滤噪音：例如，您可能想要过滤掉健康检查。

Terminology

术语

It’s important to use consistent terminology when discussing sampling. A trace or span is considered “sampled” or “not sampled”: 在讨论采样时使用一致的术语非常重要。Trace或Span被视为“已采样”或“未采样”：

Sampled: A trace or span is processed and exported. Because it is chosen by the sampler as a representative of the population, it is considered “sampled”. 已采样：处理并导出Trace或Span。因为它是由采样器选择作为总体的代表，所以它被认为是“已采样”。
Not sampled: A trace or span is not processed or exported. Because it is not chosen by the sampler, it is considered “not sampled”. 未采样：未处理或未导出的Trace或Span。因为它不是由采样器选择的，所以被认为是“未采样”。

Sometimes, the definitions of these terms get mixed up. You may find someone state that they are “sampling out data” or that data not processed or exported is considered “sampled”. These are incorrect statements. 有时，这些术语的定义会混淆。您可能会发现有人声称他们正在“采样数据”，比如未处理或导出的数据被视为“采样”。这些都是不正确的说法。

Head Sampling

头部采样

Head sampling is a sampling technique used to make a sampling decision as early as possible. A decision to sample or drop a span or trace is not made by inspecting the trace as a whole. 头部采样是一种用于尽早做出采样决定的采样技术。对Span或Trace进行采样或删除的决定不是通过检查整个Trace来做出的。

For example, the most common form of head sampling is Consistent Probability Sampling. It may also be referred to as Deterministic Sampling. In this case, a sampling decision is made based on the trace ID and a desired percentage of traces to sample. This ensures that whole traces are sampled - no missing spans - at a consistent rate, such as 5% of all traces. 例如，最常见的头部采样形式是一致概率采样。它也可以称为确定性采样。在这种情况下，采样决策是根据Trace ID和所需的采样Trace百分比做出的。这可确保对整个Trace保持一致的速率（例如所有Trace的 5%）进行采样，且不会丢失Span。

The upsides to head sampling are: 头部采样的优点是：

Easy to understand 容易明白
Easy to configure 易于配置
Efficient 高效的
Can be done at any point in the trace collection pipeline 可以在Trace收集流程中的任何阶段完成

The primary downside to head sampling is that it is not possible make a sampling decision based on data in the entire trace. This means that head sampling is effective as a blunt instrument, but is wholly insufficient for sampling strategies that must take whole-system information into account. For example, it is not possible to use head sampling to ensure that all traces with an error within them are sampled. For this, you need Tail Sampling. 头部采样的主要缺点是无法根据整个Tace中的数据做出采样决策。这意味着首部采样在某种程度上会有效，但对于必须考虑整个系统信息的采样策略来说完全不够。例如，不可能使用头部采样来确保对其中有错误的所有Trace进行采样。为此，您需要尾部采样。

Tail Sampling

尾部取样

Tail sampling is where the decision to sample a trace takes place by considering all or most of the spans within the trace. Tail Sampling gives you the option to sample your traces based on specific criteria derived from different parts of a trace, which isn’t an option with Head Sampling. 尾部采样是通过考虑Trace内的全部或大部分Span来决定对Trace哪些地方的Span进行采样的方法。尾部采样允许您根据Trace的不同部分派生出的特定条件对Trace进行采样，而头部采样则无法提供此选项。

Some examples of how you can use Tail Sampling include: 使用尾部采样的一些示例包括：

Always sampling traces that contain an error 始终对包含错误的Trace进行采样
Sampling traces based on overall latency 基于总体延迟的的Trace采样
Sampling traces based on the presence or value of specific attributes on one or more spans in a trace; for example, sampling more traces originating from a newly deployed service 根据Trace中一个或多个Span上特定属性的存在与否或值对Trace进行采样；例如，对源自新部署的服务的更多Trace进行采样
Applying different sampling rates to traces based on certain criteria 根据某些条件对Trace采用不同的采样率

As you can see, tail sampling allows for a much higher degree of sophistication. For larger systems that must sample telemetry, it is almost always necessary to use Tail Sampling to balance data volume with usefulness of that data. 正如您所看到的，尾部采样可以实现更高程度的复杂性。对于必须对Telemetry数据进行采样的大型系统，几乎总是需要使用尾部采样来平衡数据量和数据的有用性。

There are three primary downsides to tail sampling today: 目前尾部采样存在三个主要缺点：

Tail sampling can be difficult to implement. Depending on the kind of sampling techniques available to you, it is not always a “set and forget” kind of thing. As your systems change, so too will your sampling strategies. For a large and sophisticated distributed system, rules that implement sampling strategies can also be large and sophisticated. 尾部采样可能很难实现。根据您可用的采样技术类型，它并不总是“设置后忘记”的事情。随着您的系统发生变化，您的采样策略也会发生变化。对于大型且复杂的分布式系统，实现采样策略的规则也可能庞大且复杂。
Tail sampling can be difficult to operate. The component(s) that implement tail sampling must be stateful systems that can accept and store a large amount of data. Depending on traffic patterns, this can require dozens or even hundreds of nodes that all utilize resources differently. Furthermore, a tail sampler may need to “fall back” to less computationally-intensive sampling techniques if it is unable to keep up with the volume of data it is receiving. Because of these factors, it is critical to monitor tail sampling components to ensure that they have the resources they need to make the correct sampling decisions. 尾部取样可能很难操作。实现尾部采样的组件必须是可以接受和存储大量数据的有状态系统。根据流量模式，这可能需要数十个甚至数百个节点，这些节点都以不同的方式利用资源。此外，如果尾部采样器无法跟上正在接收的数据量，则可能需要“回退”到计算强度较低的采样技术。由于这些因素，监控尾部采样组件以确保它们拥有做出正确采样决策所需的资源至关重要。
Tail samplers often end up being in the domain of vendor-specific technology today. If you’re using a paid vendor for Observability, the most effective tail sampling options available to you may be limited to what the vendor offers. 如今，尾部采样器通常最终属于特定于供应商的技术领域。如果您使用付费供应商来实现可观测性，那么您可用的最有效的尾部采样选项可能仅限于该供应商提供的选项。

Finally, for some systems, tail sampling may be used in conjunction with Head Sampling. For example, a set of services that produce an extremely high volume of trace data may first use head sampling to only sample a small percentage of traces, and then later in the telemetry pipeline use tail sampling to make more sophisticated sampling decisions before exporting to a backend. This is often done in the interest of protecting the telemetry pipeline from being overloaded. 最后，对于某些系统，尾部采样可以与头部采样结合使用。例如，生成大量跟踪数据的一组服务可能首先使用头部采样仅对一小部分跟踪进行采样，然后在Telemetry管道中使用尾部采样做出更复杂的采样决策，然后再导出到后端。这样做通常是为了防止Telemetry管道过载。

Support

Collector

The OpenTelemetry Collector includes the following sampling processors: OpenTelemetry Collector 包括以下采样处理器：

Probabilistic Sampling Processor 概率采样处理器
Tail Sampling Processor 尾部采样处理器

Language SDKs

For the individual language specific implementations of the OpenTelemetry API & SDK you will find support for sampling at the respective documentation pages: 对于 OpenTelemetry API 和 SDK 的各个语言特定实现，您可以在相应的文档页面找到采样支持：