Business Evaluation

Last updated: 2023-12-21 16:00:14

Select a cluster type

EMR clusters offer a variety of cluster types, allowing you to select the type that best suits your business needs:
Hadoop Cluster: Based on open-source Hadoop and its surrounding ecosystem components, it offers five application scenarios: default scenario, Zookeeper, HBase, Presto, Kudu; providing big data solutions that cater to massive data storage, offline/real-time data analysis, streaming data computation, and machine learning scenarios.
Kafka Cluster: A distributed, partitioned, multi-replica, multi-subscriber message processing system, coordinated based on Zookeeper, primarily suitable for asynchronous processing, message communication, and scenarios involving the reception and distribution of streaming data.
StarRocks Cluster: Utilizing comprehensive vectorization technology, it supports a high-speed unified OLAP analysis database, suitable for a variety of data analysis scenarios including multi-dimensional analysis, real-time analysis, and high concurrency scenarios.

Select a billing model

EMR clusters offer two billing models:
Annual and Monthly Subscription Clusters: The billing model for all nodes in the cluster is based on an annual and monthly subscription, suitable for long-term, stable computational clusters.
Pay-as-you-go Clusters: The billing model for all nodes in the cluster is based on pay-as-you-go, suitable for clusters that exist for short periods or periodically.
Note
When shutting down pay-as-you-go nodes of the EMR cluster in the CVM console, please exercise caution when selecting the shutdown mode, as EMR nodes do not support the no-charge shutdown mode.

Selecting Machine Type Specifications

EMR offers a variety of cloud server types, including EMR Standard, EMR Computational, EMR High IO, EMR Memory, and EMR Big Data types (if you require Blackstone 2.0, you can contact us through Pre-sales Consultation).
You can select the machine type based on your own business needs and cost considerations.
If you have specific latency requirements for offline computing, we recommend that you opt for local disk or big data machine types.
If you require the use of the real-time database HBase, we recommend that you select the EMR High IO type and opt for a local SSD disk to achieve optimal performance.
Local disk models are not supported for deployment on Master and Common nodes. Please select a non-local disk model.

Recommended Node Specifications

EMR defines five types of nodes, from which you can choose based on the cluster type:
Cluster Type
Scenarios
Node type
Recommended Specification
Hadoop
Default scene
Master
Master Node: It is advisable to opt for an instance specification with a larger memory, preferably at least 8G. Choosing a cloud disk for storage can enhance the stability of the cluster.
Core
If the majority of your data resides on COS object storage, the functions of Core nodes and Task nodes are similar, with a size not less than 500G. Core nodes do not possess elasticity features.
If your architecture does not utilize COS object storage, then the Core nodes are responsible for the cluster's computational and storage tasks. EMR defaults to triple redundancy, so when estimating the size of the data disk, the space for triple redundancy must be considered. It is recommended to use big data models.
Task
If your architecture does not employ COS object storage, the use of Task nodes can be dispensed with.
If the majority of your data is stored on COS object storage, then Task nodes can be utilized as elastic computing resources, acquired as needed.
If your cluster's billing mode is on an annual or monthly basis, the billing mode for Task nodes should be pay-as-you-go.
Common
Common nodes: These nodes are primarily used as zk nodes. A minimum specification of 2C4G with a 100G cloud disk can satisfy the requirements.
Router
Router nodes: Primarily used to alleviate the load on the master node and serve as a task submission machine, it is advisable to choose a model with a larger memory, preferably not less than the Master specification.
Zookeeper
Common
Common nodes: Primarily used as zk nodes, a minimum specification of 2C4G with a 100G cloud disk is sufficient to meet the requirements.
HBase
Master
Master Node: It is advisable to opt for an instance specification with a larger memory, preferably at least 8G. Choosing a cloud disk for storage can enhance the stability of the cluster.
Core
If the majority of your data is stored on COS object storage, then the functions of Core nodes and Task nodes are similar, with a size not less than 500G.
Please note, Core nodes do not possess elastic capabilities.
If your architecture does not utilize COS object storage, then the Core nodes are responsible for the cluster's computational and storage tasks.
Task
If your architecture does not employ COS object storage, the use of Task nodes can be dispensed with.
If the majority of your data is stored on COS object storage, then Task nodes can be utilized as elastic computing resources, acquired as needed.
If your cluster's billing mode is on an annual or monthly basis, and you require the Task node's billing mode to be pay-as-you-go, then you need to set the Task node quantity to zero here. You can then scale up the pay-as-you-go Task nodes as needed via the console or API.
Common
Common nodes: Primarily used as zk nodes, a minimum specification of 2C4G with a 100G cloud disk is sufficient to meet the requirements.
Router
Router nodes: Primarily used to alleviate the load on the master node and serve as a task submission machine, it is advisable to choose a model with a larger memory, preferably not less than the Master specification.
kudu
Master
Master Node: It is advisable to opt for an instance specification with a larger memory, preferably at least 8G. Choosing a cloud disk for storage can enhance the stability of the cluster.
Core
If the majority of your data is stored on COS object storage, then the functions of Core nodes and Task nodes are similar, with a size not less than 500G.
Please note: Core nodes do not possess elastic capabilities.
If your architecture does not utilize COS object storage, then the Core nodes are responsible for the cluster's computational and storage tasks. EMR defaults to triple redundancy, so when estimating the size of the data disk, the space for triple redundancy must be considered. It is recommended to use big data models.
Task
If your architecture does not employ COS object storage, the use of Task nodes can be dispensed with.
If the majority of your data is stored on COS object storage, then Task nodes can be utilized as elastic computing resources, acquired as needed.
If your cluster's billing mode is on an annual or monthly basis, and you require the Task node's billing mode to be pay-as-you-go, then you need to set the Task node quantity to zero here. You can then scale up the pay-as-you-go Task nodes as needed via the console or API.
Common
Common nodes: Primarily used as zk nodes, a minimum specification of 2C4G with a 100G cloud disk is sufficient to meet the requirements.
Router
Router nodes: Primarily used to alleviate the load on the master node and serve as a task submission machine, it is advisable to choose a model with a larger memory, preferably not less than the Master specification.
presto
Master
Master Node: It is advisable to opt for an instance specification with a larger memory, preferably at least 8G. Choosing a cloud disk for storage can enhance the stability of the cluster.
Core
If the majority of your data is stored on COS object storage, then the functions of Core nodes and Task nodes are similar, with a size not less than 500G.
Please note: Core nodes do not possess elastic capabilities.
If your architecture does not utilize COS object storage, then the Core nodes are responsible for the cluster's computational and storage tasks. EMR defaults to triple redundancy, so when estimating the size of the data disk, the space for triple redundancy must be considered. It is recommended to use big data models.
Task
If your architecture does not employ COS object storage, the use of Task nodes can be dispensed with.
If the majority of your data is stored on COS object storage, then Task nodes can be utilized as elastic computing resources, acquired as needed.
If your cluster's billing mode is on an annual or monthly basis, and you require the Task node's billing mode to be pay-as-you-go, then you need to set the Task node quantity to zero here. You can then scale up the pay-as-you-go Task nodes as needed via the console or API.
Common
Common nodes: Primarily used as zk nodes, a minimum specification of 2C4G with a 100G cloud disk is sufficient to meet the requirements.
Router
Router nodes: Primarily used to alleviate the load on the master node and serve as a task submission machine, it is advisable to choose a model with a larger memory, preferably not less than the Master specification.
Kafka
Default scene
Core
Core Nodes: It is recommended to choose models with higher CPU and memory. As local disks run the risk of data loss in the event of disk failure, it is advisable to opt for cloud disks.
Common
Common Nodes: It is recommended that the minimum configuration for CPU and memory should not be less than 4C16G.
StarRocks
Default scene
Master
Master Nodes: It is recommended to choose instance specifications with larger memory, with a recommended memory size of at least 8G. All metadata on the Master nodes is stored in memory.
Core
Core Nodes: It is recommended to choose instance specifications with larger memory, with a recommended memory size of at least 8G. For better IO performance and stability, it is advisable to use cloud SSD disks.
Router
Router Nodes: These deploy the Frontend module, achieving high availability for read and write operations. Therefore, it is recommended to choose models with larger memory, preferably not less than the Master specifications.
Note
Different cluster types have varying requirements for node specifications. Currently, the system will default to recommending configurations that meet the cluster requirements. You can adjust the model specifications according to business needs, with the recommended models serving only as a reference.
Core Nodes do not possess elasticity features. If your architecture does not utilize COS object storage, then the Core Nodes are responsible for the cluster's computation and storage tasks. EMR defaults to triple backup, so when estimating the size of the data disk, the space for triple backup must be considered. It is recommended to use big data models.

Network and Security

To ensure the network security of the cluster, the EMR cluster will be placed within a VPC, to which we will add a security group policy. To ensure convenient access to the WebUI of Hadoop ecosystem components, we have enabled an external IP for one of the Master nodes, which is billed according to traffic. By default, Router nodes do not have external IP enabled. If needed, you can freely bind an elastic public IP in the CVM console.
Note
Master Nodes default to enabling external IP when creating a cluster, but users can choose not to enable external IP based on their circumstances.
Enabling the public network for the cluster's Master Nodes is primarily used for SSH login and viewing component WebUI.
The Master Nodes will enable external network, which is billed based on traffic, with a bandwidth cap of 5M. After creating the cluster, you can adjust this network in the console.