The software configuration allows you to customize the settings of components such as HDFS, YARN, Hive, etc., during the creation of your cluster.
Custom Software Configuration
Software such as Hadoop and Hive contain a multitude of configurations. With the software configuration feature, you can independently configure component parameters during the process of creating a new cluster. The configuration process requires you to provide the corresponding JSON file as per the requirements. This file can be customized by you, or you can export the software configuration parameters of an existing cluster, and then quickly create a new one. For details on exporting software configuration parameters, please refer to Export Software Configuration.
The 'serviceName' is the component name, which must be in uppercase.
The 'classification' is the filename, which must be used in its entirety, including the suffix.
The 'serviceVersion' is the version name of the component. This version must be consistent with the component version corresponding to the EMR product version.
In 'properties', fill in the parameters that need to be configured independently.
If you need to modify the configuration parameters in capacity-scheduler.xml or fair-scheduler.xml, the property key in 'properties' should be specified as 'content', and the value should be the entire content of the file.
If you need to adjust the configuration of existing clusters, you can proceed with the Configuration Distribution.
Accessing External Clusters
After configuring the access address information for the external cluster's HDFS, you can read the data from the external cluster.
Configuration at Purchase
EMR supports configuring access to external clusters when creating a new cluster. This can be done by entering a compliant JSON file in the software configuration section on the Purchase Page. The following example illustrates this under hypothetical conditions:
Assumed Conditions
Assume that the nameservice required to access the external cluster is HDFS8088, and its access method is as follows:
If you need to access an external cluster from a newly created cluster, navigate to the Purchase Page and open the advanced settings.
JSON File and Explanation:
Using the assumed conditions as an example, the JSON file content should be filled in the box (the JSON content requirements are the same as the custom software configuration).
The classification filename must be "hdfs-site.xml".
The 'serviceVersion' is the version name of the component. This version must be consistent with the component version corresponding to the EMR product version.
The content filled in the properties is consistent with the assumed conditions.
newNameServiceName (optional) represents the nameservice of the newly created cluster. If left blank, it will be generated by the system; if not blank, it can only consist of a combination of strings, numbers, and hyphens.
Note
The external cluster being accessed only supports highly available clusters.
The external cluster being accessed only supports clusters that have not enabled kerberos.
Configuration after purchase
After the creation of the EMR cluster, it supports accessing the external cluster through the Configuration Distribution feature of EMR.
Assume the following conditions:
Assume the nameservice of this cluster is HDFS80238 (if it is a non-highly available cluster, it is generally masterIp:rpcport, for example, 172.21.0.11:4007).
The nameservice of the external cluster that needs to be accessed is HDFS8088, and its access method is:
If this information is in the EMR cluster, it can be viewed on the Configuration Distribution management page, or by logging into the machine to view the /usr/local/service/hadoop/etc/hadoop/hdfs-site.xml file.
The dfs.internal.nameservice needs to be added, otherwise, after expanding the cluster, it may cause the datanode to report an exception and be marked as dead by the namenode.