Software Configuration

Last updated: 2023-12-21 16:06:50

Feature Description

The software configuration allows you to customize the settings of components such as HDFS, YARN, Hive, etc., during the creation of your cluster.

Custom Software Configuration

Software such as Hadoop and Hive contain a multitude of configurations. With the software configuration feature, you can independently configure component parameters during the process of creating a new cluster. The configuration process requires you to provide the corresponding JSON file as per the requirements. This file can be customized by you, or you can export the software configuration parameters of an existing cluster, and then quickly create a new one. For details on exporting software configuration parameters, please refer to Export Software Configuration.
JSON File Example and Explanation:
[
{
"serviceName": "HDFS",
"classification": "hdfs-site.xml",
"serviceVersion": "2.8.4",
"properties": {
"dfs.blocksize": "67108864",
"dfs.client.slow.io.warning.threshold.ms": "900000",
"output.replace-datanode-on-failure": "false"
}
},
{
"serviceName": "YARN",
"classification": "yarn-site.xml",
"serviceVersion": "2.8.4",
"properties": {
"yarn.app.mapreduce.am.staging-dir": "/emr/hadoop-yarn/staging",
"yarn.log-aggregation.retain-check-interval-seconds": "604800",
"yarn.scheduler.minimum-allocation-vcores": "1"
}
},
{
"serviceName": "YARN",
"classification": "capacity-scheduler.xml",
"serviceVersion": "2.8.4",
"properties": {
"content": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<?xml-stylesheet type=\"text/xsl\" href=\"configuration.xsl\"?>\n<configuration><property>\n <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>\n <value>0.8</value>\n</property>\n<property>\n <name>yarn.scheduler.capacity.maximum-applications</name>\n <value>1000</value>\n</property>\n<property>\n <name>yarn.scheduler.capacity.root.default.capacity</name>\n <value>100</value>\n</property>\n<property>\n <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>\n <value>100</value>\n</property>\n<property>\n <name>yarn.scheduler.capacity.root.default.user-limit-factor</name>\n <value>1</value>\n</property>\n<property>\n <name>yarn.scheduler.capacity.root.queues</name>\n <value>default</value>\n</property>\n</configuration>"
}
}
]
Configuration Parameter Explanation:
The 'serviceName' is the component name, which must be in uppercase.
The 'classification' is the filename, which must be used in its entirety, including the suffix.
The 'serviceVersion' is the version name of the component. This version must be consistent with the component version corresponding to the EMR product version.
In 'properties', fill in the parameters that need to be configured independently.
If you need to modify the configuration parameters in capacity-scheduler.xml or fair-scheduler.xml, the property key in 'properties' should be specified as 'content', and the value should be the entire content of the file.
If you need to adjust the configuration of existing clusters, you can proceed with the Configuration Distribution.

Accessing External Clusters

After configuring the access address information for the external cluster's HDFS, you can read the data from the external cluster.

Configuration at Purchase

EMR supports configuring access to external clusters when creating a new cluster. This can be done by entering a compliant JSON file in the software configuration section on the Purchase Page. The following example illustrates this under hypothetical conditions:
Assumed Conditions Assume that the nameservice required to access the external cluster is HDFS8088, and its access method is as follows:
<property>
<name>dfs.ha.namenodes.HDFS8088</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.http-address.HDFS8088.nn1</name>
<value>172.21.16.11:4008</value>
</property>
<property>
<name>dfs.namenode.https-address.HDFS8088.nn1</name>
<value>172.21.16.11:4009</value>
</property>
<property>
<name>dfs.namenode.rpc-address.HDFS8088.nn1</name>
<value>172.21.16.11:4007</value>
</property>
<property>
<name>dfs.namenode.http-address.HDFS8088.nn2</name>
<value>172.21.16.40:4008</value>
</property>
<property>
<name>dfs.namenode.https-address.HDFS8088.nn2</name>
<value>172.21.16.40:4009</value>
</property>
<property>
<name>dfs.namenode.rpc-address.HDFS8088.nn2</name>
<value>172.21.16.40:4007</value>
</property>
If you need to access an external cluster from a newly created cluster, navigate to the Purchase Page and open the advanced settings.
f964877122db0e9b5f6f68d7b1e9e6f6.png


JSON File and Explanation: Using the assumed conditions as an example, the JSON file content should be filled in the box (the JSON content requirements are the same as the custom software configuration).
[
{
"serviceName": "HDFS",
"classification": "hdfs-site.xml",
"serviceVersion": "2.7.3",
"properties": {
"newNameServiceName": "newEmrCluster",
"dfs.ha.namenodes.HDFS8088": "nn1,nn2",
"dfs.namenode.http-address.HDFS8088.nn1": "172.21.16.11:4008",
"dfs.namenode.https-address.HDFS8088.nn1": "172.21.16.11:4009",
"dfs.namenode.rpc-address.HDFS8088.nn1": "172.21.16.11:4007",
"dfs.namenode.http-address.HDFS8088.nn2": "172.21.16.40:4008",
"dfs.namenode.https-address.HDFS8088.nn2": "172.21.16.40:4009",
"dfs.namenode.rpc-address.HDFS8088.nn2": "172.21.16.40:4007"
}
}
]

Configuration Parameter Explanation:

The serviceName component name must be "HDFS".
The classification filename must be "hdfs-site.xml".
The 'serviceVersion' is the version name of the component. This version must be consistent with the component version corresponding to the EMR product version.
The content filled in the properties is consistent with the assumed conditions.
newNameServiceName (optional) represents the nameservice of the newly created cluster. If left blank, it will be generated by the system; if not blank, it can only consist of a combination of strings, numbers, and hyphens.
Note
The external cluster being accessed only supports highly available clusters. The external cluster being accessed only supports clusters that have not enabled kerberos.

Configuration after purchase

After the creation of the EMR cluster, it supports accessing the external cluster through the Configuration Distribution feature of EMR.
Assume the following conditions: Assume the nameservice of this cluster is HDFS80238 (if it is a non-highly available cluster, it is generally masterIp:rpcport, for example, 172.21.0.11:4007). The nameservice of the external cluster that needs to be accessed is HDFS8088, and its access method is:
<property>
<name>dfs.ha.namenodes.HDFS8088</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.http-address.HDFS8088.nn1</name>
<value>172.21.16.11:4008</value>
</property>
<property>
<name>dfs.namenode.https-address.HDFS8088.nn1</name>
<value>172.21.16.11:4009</value>
</property>
<property>
<name>dfs.namenode.rpc-address.HDFS8088.nn1</name>
<value>172.21.16.11:4007</value>
</property>
<property>
<name>dfs.namenode.http-address.HDFS8088.nn2</name>
<value>172.21.16.40:4008</value>
</property>
<property>
<name>dfs.namenode.https-address.HDFS8088.nn2</name>
<value>172.21.16.40:4009</value>
</property>
<property>
<name>dfs.namenode.rpc-address.HDFS8088.nn2</name>
<value>172.21.16.40:4007</value>
<property>

If this information is in the EMR cluster, it can be viewed on the Configuration Distribution management page, or by logging into the machine to view the /usr/local/service/hadoop/etc/hadoop/hdfs-site.xml file.
1. Enter the Configuration Distribution page, select the hdfs-site.xml file of the hdfs component.
2. Modify the configuration item dfs.nameservices to HDFS80238,HDFS8088.
3. Add configuration item and value
Configuration items
Description
dfs.ha.namenodes.HDFS8088
nn1,nn2
fs.namenode.http-address.HDFS8088.nn1
172.21.16.11:4008
dfs.namenode.https-address.HDFS8088.nn1
172.21.16.11:4009
dfs.namenode.rpc-address.HDFS8088.nn1
172.21.16.11:4007
fs.namenode.http-address.HDFS8088.nn2
172.21.16.40:4008
dfs.namenode.https-address.HDFS8088.nn2
172.21.16.40:4009
dfs.namenode.rpc-address.HDFS8088.nn2
172.21.16.40:4007
dfs.client.failover.proxy.provider.HDFS8088
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
dfs.internal.nameservices
HDFS80238
Note
The dfs.internal.nameservice needs to be added, otherwise, after expanding the cluster, it may cause the datanode to report an exception and be marked as dead by the namenode.
4. Distribute the configuration using the Configuration Distribution feature.
For more detailed information and principles related to the configuration, please refer to the Community Documentation.