Observability Platform - Technical Selection Analysis

原创

行者深蓝

发布于 2023-12-06 18:07:53

2220

发布于 2023-12-06 18:07:53

文章被收录于专栏：云原生应用工坊

Observability

Image Reference : https://mp.weixin.qq.com/s/nAF3lv-qZprLWvOdvSbYXg

Observability refers to the extent to which a system's internal states can be inferred from its external outputs. In mathematics, observability and controllability are dual concepts.

In modern software systems and cloud computing, observability plays an increasingly important role in ensuring the reliability, performance, and security of applications and infrastructure. As software systems become more complex, with widespread adoption of microservices and increasing reliance on distributed architectures, the importance of observability becomes more pronounced.

Observability mainly includes the following aspects:

Logs: Logs are records of system events collected during system operation, including errors, warnings, and information. Logs can provide detailed information about the internal state of the system, such as system startup and shutdown, resource usage, errors, and exceptions.
Metrics: Metrics are statistical data about system performance collected during operation, such as CPU usage, memory usage, and network traffic. Metrics provide an overview of the system's operational status, such as overall health and performance bottlenecks.
Tracing: Tracing is the full-path tracking of requests and responses in the system, which can help analyze performance bottlenecks, errors, and other issues.

Observability tools help system administrators and developers collect and analyze the above data, thus improving understanding and control of the system.

Applications of observability include:

Troubleshooting: Observability tools can help quickly locate and resolve system faults.
Performance Optimization: Observability tools can help identify performance bottlenecks for optimization.
Security Monitoring: Observability tools can help monitor the security status of the system to prevent security incidents.

Evolution of Monitoring

From the era of monolithic applications to the era of microservices, the dimensions of monitoring data (metrics, logs, traces) have evolved as follows:

Monolithic Applications

In the era of monolithic applications, applications were typically deployed as a single unit on a server. Therefore, the basis of monitoring data was usually singular, such as server CPU, memory, network metrics, etc.

SOA Applications

In the SOA era, applications were split into multiple independent services, each of which could be developed, deployed, and managed independently. Thus, the basis of monitoring data became more complex, requiring attention to the resource usage and performance metrics of each service.

Distributed Applications

In the era of microservices, applications are split into even finer-grained microservices, each typically responsible for a specific business function. Therefore, the basis of monitoring data became even more extensive, requiring attention to the resource usage, performance metrics, and tracing of each microservice.

The dimensions of metrics, logs, and tracing in different eras are summarized as follows:

Era	Metrics	Logs	Tracing
Monolithic	Server resource usage, etc.	Application logs	None
SOA	Service resource usage, etc.	Service logs	Service call links
Microservices	Microservice resource usage, etc.	Microservice logs	Microservice call links

As application architectures evolve, the dimensions of monitoring data have grown increasingly extensive, posing higher demands on the design and implementation of monitoring systems. Monitoring systems must be capable of collecting, storing, and analyzing monitoring data from various sources and dimensions, providing comprehensive support for application maintenance.

Resource Monitoring vs. Application Observability

Traditional resource-focused monitoring primarily addresses the operational status of systems, including overall health and performance bottlenecks. Traditional resource monitoring typically uses metrics to measure system status, such as CPU usage, memory usage, network traffic, etc.

Application observability, on the other hand, focuses not only on system status but also on application business logic and data. Application observability typically uses logs, tracing, and other technologies to collect and analyze data produced during application runtime.

Differences

Aspect	Traditional Resource Monitoring	Application Observability
Focus	System operational status	Application status, business logic, data
Data Sources	Metrics	Logs, tracing

Relationship

Traditional resource monitoring is a part of application observability. Application observability needs to collect and analyze system status metrics, often provided by traditional resource monitoring.

Scope

Traditional resource monitoring is typically limited to the system level, such as servers, containers, databases, etc. Application observability can extend to the application level, including business logic, data, etc.

In summary, resource monitoring and application observability are related but distinct concepts. Traditional resource monitoring is a part of application observability, providing a foundation for it. Application observability can extend to the application level, supporting analysis of business logic and data.

System Monitoring vs. Application Observability

System monitoring primarily focuses on the operational status of systems, including overall health and performance bottlenecks. System monitoring typically uses metrics to measure system status, such as CPU usage, memory usage, network traffic, etc.

Application observability, in contrast, focuses not only on system status but also on application business logic and data. Application observability typically uses logs, tracing, and other technologies to collect and analyze data produced during application runtime.

The differences between system monitoring and application observability can be summarized as follows:

Aspect	System Monitoring	Application Observability
Analysis Purpose	Fault localization, performance optimization	Fault localization, performance optimization, business logic analysis, data understanding
Monitoring Metrics	CPU, Memory, Usage, Load	SLOs, SLIs, Time measurements, Event measurements, Availability

For example, SLOs are the service level objectives of an application, SLIs measure SLOs, time and event measurements help analyze business logic, and availability helps understand data situations.

Suggestions for addressing the evolution of monitoring data include:

Adopting distributed monitoring systems to handle the growth of monitoring data.
Using data analysis techniques to extract valuable information from monitoring data, enhancing efficiency and effectiveness.
Employing automation tools to reduce manual intervention and improve monitoring automation.

Evolution of Monitoring Data Storage Methods

As application architectures evolve, the methods for storing monitoring data have also changed. In the era of monolithic applications, file storage was sufficient for monitoring data needs. In the SOA and microservices era, distributed databases such as TSDB and NoSQL are required. In the future, with the growth of monitoring data volumes and analytical demands, emerging database technologies like graph databases will play an increasingly important role in monitoring data storage.

Storage Comparison

Storage Method	Data Model	Storage Efficiency	Query Efficiency	Suitable Data Types	Applicable Scenarios	Limitations
File Storage	Unstructured	Low	Low	All	Simple Data Storage	Complex Data Management, Poor Scalability
SQLDB	Relational	High	High	Structured	Data Analysis	Poor at Storing Unstructured Data, Limited Horizontal Scaling
TSDB	Time-Series	High	High	Time-Series Data	Monitoring Metrics	Poor at Storing Unstructured Data, Limited Data Types Supported
NoSQL	Non-Relational	High	Low to High	All	Diverse Data Storage	Flexible Data Model, Less Efficient Queries than Relational Databases
Row Database	Row	High	High	Structured	Log Data	Flexible Data Model
Column Database	Column	High	High	Unstructured	Link Tracing Data	Flexible Data Model
Graph Database	Graph	High	High	Relational Data	Application Topology	Flexible Data Model

Monitoring System Technology Selection

Monitoring System	Metric Data	Log Data	Link Tracing Data
Nagios	File Storage	File Storage	Not Supported
Zabbix	SQLDB	SQLDB	Not Supported
Prometheus	TSDB	TSDB	Not Supported
Observability Platform	TSDB	NoSQL	NoSQL/Graph Database

Selection Recommendations

Metric Data: TSDB is the best choice for storing metric data due to its high performance, reliability, and scalability.
Log Data: NoSQL databases are best for storing log data, offering flexible storage structures and high scalability.
Link Tracing Data: NoSQL and graph databases are ideal for storing and analyzing complex relational data.

Advantages of Column and Graph Databases

Column and graph databases have become mainstream choices due to their storage efficiency and scalability. In the realm of AI-assisted monitoring (AIGC), vector databases play a crucial role.

Building an Open Source Observability Platform

Combine different software components to build an observability platform tailored to specific needs.

Open Source Observability Platform Software Combinations

Data Storage: TSDB, NoSQL, or graph databases like ClickHouse, Neo4j, VectorDB.
Metric Data Collection: Tools like OpenTelemetry, Prometheus.
Visualization: Tools like Grafana.
Alerting: Tools like AlertManager.
Fault Diagnosis: Tools like DeepFlow.

Components

ClickHouse: Columnar database for storing metric, log, and link tracing data.
Neo4j: Graph database for storing complex link topologies and dependencies.
VectorDB: Vector database for AI engine analysis.
PromQL and LogQL: Query languages for Prometheus and Loki, respectively.
OpenTelemetry: Standard for collecting and storing link tracing data.
Grafana: Visualization tool.
AlertManager: Alerting system.
DeepFlow: Fault diagnosis tool.

References

Open Source Observability Platform Solutions: https://cloud.tencent.com/developer/article/2363793
Open Source Observability Platform Solutions - Operations Manual: https://cloud.tencent.com/developer/article/2363815

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

开源软件

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

开源软件

#可观测解决方案

登录后参与评论

0 条评论

热度