EMR accommodates a variety of cluster types and corresponding application scenarios, defining five kinds of node types. The node types supported, the number of nodes deployed, and the services deployed vary among different cluster types and application scenarios. Depending on the business requirements, one can choose different cluster types and application scenarios to create a cluster.
Cluster Type Description
Hadoop Cluster
Scenarios | Description | Node Deployment Description |
Default scene | Based on open-source Hadoop and its surrounding ecosystem components, we offer big data solutions for scenarios such as massive data storage, offline/real-time data analysis, streaming data computation, and machine learning. | Master Node: Serves as the management node, ensuring the normal scheduling of the cluster; primarily deploys processes such as NameNode, ResourceManager, HMaster; the quantity is 1 in non-HA mode, and 2 in HA mode. Note: When the deployment components include Kudu, the cluster only supports HA mode, with the number of Master nodes being 3. Core Node: Acts as a computation and storage node, all your data in HDFS is stored in the Core nodes. Therefore, to ensure data safety, it is not permissible to scale down after expanding the Core nodes; primarily deploys processes such as DataNode, NodeManager, RegionServer. The quantity is ≥2 in non-HA mode, and ≥3 in HA mode. Task Node: Functions as a pure computation node, does not store data, the data to be computed comes from Core nodes and COS. Therefore, Task nodes are often used as elastic nodes, which can be expanded and scaled down at any time; primarily deploys processes such as NodeManager, PrestoWork; the number of Task nodes can be changed at any time to achieve elastic scaling of the cluster, with a minimum value of 0. Common Node: Provides data sharing synchronization and high availability fault tolerance services for HA cluster Master nodes; primarily deploys distributed coordinator components, such as ZooKeeper, JournalNode nodes. The quantity is 0 in non-HA mode, and ≥3 in HA mode. Router Node: Utilized to share the load of the Master nodes or to serve as the task submission machine for the cluster, it can be expanded and scaled down at any time; primarily deploys Hadoop packages, with the option to deploy software and processes such as Hive, Hue, Spark; the number of Router nodes can be changed at any time, with a minimum value of 0. |
ZooKeeper | Applicable for establishing distributed, high-availability coordination services in large-scale clusters. | Common Node: Primarily deploys the distributed coordinator component ZooKeeper, the number of deployed nodes must be odd, with a minimum of 3 Common nodes, and only supports High Availability (HA). |
HBase | Appropriate for storing massive amounts of unstructured or semi-structured data, offering a highly reliable, high-performance, column-oriented, and scalable distributed storage system for real-time data read and write operations. | Master Node: Serves as the management node, ensuring the normal scheduling of the cluster; primarily deploys processes such as NameNode, ResourceManager, HMaster; the quantity is 1 in non-HA mode, and 2 in HA mode. Core Node: Acts as a computation and storage node, all your data in HDFS is stored in the Core nodes. Therefore, to ensure data safety, it is not permissible to scale down after expanding the Core nodes; primarily deploys processes such as DataNode, NodeManager, RegionServer. The quantity is ≥2 in non-HA mode, and ≥3 in HA mode. Task Node: Acts as a pure computing node, does not store data, the data to be computed comes from Core nodes and COS, hence Task nodes are often used as elastic nodes, capable of being expanded and scaled down at any time; primarily deploys processes such as NodeManager; the number of Task nodes can be changed at any time to achieve elastic scaling of the cluster, with a minimum value of 0. Common Node: Provides data sharing synchronization and high availability fault tolerance services for HA cluster Master nodes; primarily deploys distributed coordinator components, such as ZooKeeper, JournalNode nodes. The quantity is 0 in non-HA mode, and ≥3 in HA mode. Router Node: Employed to alleviate the load of the Master nodes or to function as the task submission mechanism for the cluster, it can be expanded and scaled down at any time; the quantity of Router nodes can be modified at any time, with a minimum value of 0. |
Presto | Offers an open-source distributed SQL query engine, suitable for interactive analytical queries, supporting rapid query analysis of massive data. | Master Node: Functions as a management node, ensuring the normal scheduling of the cluster; primarily deploys processes such as NameNode, ResourceManager, etc.; the quantity is 1 in non-HA mode, and 2 in HA mode. Core Node: Functions as a computing and storage node, all your data in HDFS is stored in Core nodes, therefore, to ensure data safety, it is not permissible to scale down after expanding Core nodes; primarily deploys processes such as DataNode, NodeManager, etc. The quantity is ≥2 in non-HA mode, and ≥3 in HA mode. Task Node: Functions as a pure computation node, does not store data, the data to be computed comes from Core nodes and COS. Therefore, Task nodes are often used as elastic nodes, which can be expanded and scaled down at any time; primarily deploys processes such as NodeManager, PrestoWork; the number of Task nodes can be changed at any time to achieve elastic scaling of the cluster, with a minimum value of 0. Common Node: Provides data sharing synchronization and high availability fault tolerance services for HA cluster Master nodes; primarily deploys distributed coordinator components, such as ZooKeeper, JournalNode nodes. The quantity is 0 in non-HA mode, and ≥3 in HA mode. Router Node: Employed to alleviate the load of the Master nodes or to function as the task submission mechanism for the cluster, it can be expanded and scaled down at any time; the quantity of Router nodes can be modified at any time, with a minimum value of 0. |
Kudu | Provides a distributed, scalable columnar storage manager, supporting random read/write and OLAP analysis for processing rapidly updating data. | Master Node: Functions as a management node, ensuring the normal scheduling of the cluster; primarily deploys processes such as NameNode, ResourceManager, etc.; the quantity is 1 in non-HA mode, and 2 in HA mode. Core Node: Serves as a computational and storage node, all your data in HDFS is entirely stored within the Core nodes, hence, to ensure data security, it is not permissible to scale down after expanding Core nodes; the quantity is ≥2 in non-HA mode, and ≥3 in HA mode. Task Node: Functions as a pure computational node, does not store data, the data being computed comes from Core nodes and COS, hence, Task nodes are often used as elastic nodes, capable of scaling up and down at any time; the number of Task nodes can be altered at any time to achieve elastic scaling of the cluster, with a minimum value of 0. Common Node: Provides data sharing synchronization and high-availability fault tolerance services for HA cluster Master nodes; primarily deploys distributed coordination components, such as ZooKeeper, JournalNode, etc.; the quantity is 0 in non-HA mode, and ≥3 in HA mode. Router Node: Employed to alleviate the load of the Master nodes or to function as the task submission mechanism for the cluster, it can be expanded and scaled down at any time; the quantity of Router nodes can be modified at any time, with a minimum value of 0. |
Kafka Cluster
Scenarios | Description | Node Deployment Description |
Default scene | Offers a distributed, partitioned, multi-replica, multi-subscriber messaging system, coordinated based on ZooKeeper, primarily suitable for asynchronous processing, message communication, and scenarios involving the reception and distribution of streaming data. | Core Node: Functions as the Backend module, primarily providing data storage capabilities; deploys processes such as BE, Broker, etc., the quantity is ≥1 in non-HA mode, and ≥2 in HA mode. Common Node: Provides data sharing synchronization and high-availability fault tolerance services for HA cluster Core nodes; the quantity is 0 in non-HA mode, and ≥3 in HA mode. |
StarRocks Cluster
Scenarios | Description | Node Deployment Description |
Default scene | StarRocks employs comprehensive vectorization technology, supporting a high-speed unified OLAP analysis database, suitable for various data analysis scenarios such as multi-dimensional analysis, real-time analysis, and high concurrency. | Master Node: Functions as the Frontend module, while also providing Web UI capabilities; deploys processes such as FE Follower, Broker, etc., the quantity is ≥1 in non-HA mode, and ≥3 in HA mode. Core Node: Functions as the Backend module, primarily providing data storage capabilities; deploys processes such as BE, Broker, etc., with a deployment quantity of ≥3. Task Node: Functions as a pure computation node, does not store data, and the data to be computed comes from Core nodes and COS. Therefore, Task nodes are often used as elastic nodes, capable of scaling up and down at any time; primarily deploys Compute Node processes; the number of Task nodes can be changed at any time to achieve elastic scaling of the cluster, with a minimum value of 0. Router Node: Deploys the Frontend module, achieving high availability for read and write operations; can optionally deploy FE Observer, Broker, and other processes, can scale up by adding Router nodes, but does not support scaling down. |