前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Apache Airflow的组件和常用术语

Apache Airflow的组件和常用术语

作者头像
DevOps云学堂
发布2022-12-29 17:13:43
1.1K0
发布2022-12-29 17:13:43
举报
文章被收录于专栏:DevOps持续集成DevOps持续集成

Components in Apache Airflow

Apache Airflow 中的组件

The many functions of Airflow are determined by the perfect interaction of its components. The architecture can vary depending on the application. It is thus possible to scale flexibly from a single machine to an entire cluster. The graphic shows a multi-node architecture with several machines. Airflow 的许多功能取决于其组件的完美相互作用。体系结构可因应用程序而异。因此,可以从单台机器灵活地扩展到整个集群。该图显示了具有多台计算机的多节点体系结构。

image.png

A scheduler along with the attached executor takes care of tracking and triggering the stored workflows. While the scheduler keeps track of which task can be executed next, the executor takes care of the selection of the worker and the following communication. Since Apache Airflow 2.0 it is possible to use multiple schedulers. For particularly large numbers of tasks, this reduces latency. scheduler和附加的执行程序负责跟踪和触发存储的工作流。当调度程序跟踪下一个可以执行的任务时,执行程序负责工作线程的选择和以下通信。从Apache Airflow 2.0开始,可以使用多个调度程序。对于特别大量的任务,这可以减少延迟。

As soon as a workflow is started, a worker takes over the execution of the stored commands. For special requirements regarding RAM and GPU etc., workers with specific environments can be selected. 一旦工作流启动,工作线程就会接管存储命令的执行。对于RAM和GPU等的特殊要求,可以选择具有特定环境的worker 节点。

The web server allows easy user interaction in a graphical interface. This component runs separately. If required, the web server can be omitted, but the monitoring functions are very popular in everyday business. Web服务器允许在图形界面中轻松进行用户交互。此组件单独运行。如果需要,可以省略Web服务器,但监视功能在日常业务中非常流行。

Among other things, the metadata database securely stores statistics about workflow runs and connection data to external databases. 除此之外,元数据数据库还可以安全地存储有关工作流运行的统计信息和外部数据库的连接数据。 With this setup, Airflow is able to reliably execute its data processes. In combination with the Python programming language, it is now easy to determine what should run in the workflow and how. Before creating the first workflows, you should have heard certain terms. 通过此设置,Airflow 能够可靠地执行其数据处理。结合 Python 编程语言,现在可以轻松确定工作流中应该运行的内容以及如何运行。在创建第一个工作流之前,您应该听说过某些术语。

Important terminology in Apache Airflow

Apache Airflow 中的重要术语 The term DAG (Directed Acyclic Graph) is often used in connection with Apache Airflow. This is the internal storage form of a workflow. The term DAG is used synonymously to workflow and is probably the most central term in Airflow. Accordingly, a DAG run denotes a workflow run and the workflow files are stored in the DAG bag. The following graphic shows such a DAG. This schematically describes a simple Extract-Transform-Load (ETL) workflow. 术语DAG(有向无环图)通常用于与Apache Airflow一起使用。这是工作流的内部存储形式。术语 DAG 与工作流同义使用,可能是 Airflow 中最核心的术语。因此,DAG 运行表示工作流运行,工作流文件存储在 DAG 包中。下图显示了此类 DAG。这示意性地描述了一个简单的提取-转换-加载 (ETL) 工作流程。

With Python, associated tasks are combined into a DAG. This DAG serves programmatically as a container to keep the tasks, their order and information about the execution (interval, start time, retries in case of errors,..) together. With the definition of the relations (predecessor, successor, parallel) even complex workflows are modelable. There can be several start and end items. Only cycles are not allowed. Even conditional branching is possible. 使用 Python,关联的任务被组合成一个 DAG。此 DAG 以编程方式用作容器,用于将任务、任务顺序和有关执行的信息(间隔、开始时间、出错时的重试,..)放在一起。通过定义关系(前置、后继、并行),即使是复杂的工作流也可以建模。可以有多个开始项和结束项。只允许循环。甚至可以有条件的分支。

In the DAG tasks can be formulated either as operators or as sensors. While operators execute the actual commands, a sensor interrupts the execution until a certain event occurs. Both basic types are specialized for specific applications in numerous community developments. Plug-and-play operators are essential for easy integration with Amazon Web Service, Google Cloud Platform, and Microsoft Azure, among many others. The specialization goes from the simple BashOperator for executing Bash commands to the GoogleCloudStorageToBigQueryOperator. The long list of available operators can be seen in the Github repository. 在DAG中,任务可以表述为操作员或传感器。当操作员执行实际命令时,传感器会中断执行,直到发生特定事件。这两种基本类型都专门用于众多社区开发中的特定应用。即插即用Operators对于与Amazon Web Service,Google Cloud Platform和Microsoft Azure等轻松集成至关重要。专业化从用于执行Bash命令的简单BashOperator到GoogleCloudStorageToBigQueryOperator。在Github 存储库中可以看到一长串可用的operator。

In the web interface, the DAGs are represented graphically. In the graph view (upper figure) the tasks and their relationships are clearly visible. The status colors of the edges symbolize the state of the task in the selected workflow run. In the tree view (following graphic), past runs are also displayed. Here, too, the intuitive color scheme indicates possible errors directly at the associated task. With just two clicks, the log files can be conveniently read out. Monitoring and troubleshooting were definitely among Airflow's strengths. 在 Web 界面中,DAG 以图形方式表示。在图形视图(上图)中,任务及其关系清晰可见。边缘的状态颜色表示所选工作流运行中任务的状态。在树视图(如下图所示)中,还会显示过去的运行。在这里,直观的配色方案也直接在相关任务中指示可能出现的错误。只需单击两次,即可方便地读取日志文件。监控和故障排除绝对是Airflow的优势之一。

Whether machine learning workflow or ETL process, a look at Airflow is always worthwhile. 无论是机器学习工作流程还是ETL过程,看看Airflow总是值得的。

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2022-11-22,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 DevOps云学堂 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Components in Apache Airflow
  • Important terminology in Apache Airflow
相关产品与服务
对象存储
对象存储(Cloud Object Storage,COS)是由腾讯云推出的无目录层次结构、无数据格式限制,可容纳海量数据且支持 HTTP/HTTPS 协议访问的分布式存储服务。腾讯云 COS 的存储桶空间无容量上限,无需分区管理,适用于 CDN 数据分发、数据万象处理或大数据计算与分析的数据湖等多种场景。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档