Feature Overview
The cluster events encompass both the event list and the event policy.
Event List: A record of key changes or abnormal events occurring within the cluster.
Event Policy: Supports the customization of event monitoring trigger strategies based on business conditions. Monitored events can be set as cluster inspection items.
View event list
1. Log in to the Elastic MapReduce Console. In the cluster list, click on the corresponding cluster ID/Name to access the cluster details page.
2. In the cluster details page, select Cluster Monitoring > Cluster Events > Event List to directly view all operation events of the current cluster.

The severity levels are explained as follows:
Fatal: Abnormal events of nodes or services that require manual intervention. Otherwise, the service will be unavailable. Such events may persist for a period of time.
Severe: Temporarily has not caused service or node unavailability issues. It is a warning category. If left untreated, it may lead to fatal events.
General: Records regular events occurring in the cluster, generally requiring no special handling.
3. Click on the Number of Triggers Today column value to view the event trigger records, as well as related metrics, logs, or scenes of the event records.

Setting Event Policies
1. Log in to the EMR Console, and in the cluster list, click on the corresponding cluster ID/Name to access the cluster details page.
2. In the cluster details page, select Cluster Monitoring > Cluster Events > Event Policy to customize and set event monitoring trigger policies.
3. The event configuration list includes: Event Name, Event Discovery Policy, Severity Level (Fatal/Serious/General), and Monitoring Activation. It supports modifications and saving.

4. Event discovery policies fall into two categories: one type of event is a system-fixed policy event, which does not support user modifications; the other type of event varies according to different customer business standards, and supports user settings.

5. Event policies can be customized to enable or disable event monitoring. Only events with activated monitoring can be selected in the inspection items of cluster inspection. Some events are enabled by default, while others are enabled by default and cannot be disabled. The specific rules are as follows:
Category | Event name | Event Significance | Recommendations & Measures | Default value | Severity | Permitted to close | Enabled by Default |
Node | Continuous CPU Utilization Exceeds Threshold | Machine CPU Utilization >= m, for a duration of t seconds (300<=t<=2592000) | Node expansion or upgrade | m=85, t=1800 | Critical | Supported | Supported |
| Average CPU IO Wait Exceeds Threshold | Average CPU iowait utilization >= m within t seconds (300<=t<=2592000) | Manually check | m=60, t=1800 | Critical | Supported | Supported |
| Continuous CPU Load Exceeds Threshold for One Minute | CPU Load >= m for a duration of t seconds (300<=t<=2592000) | Node expansion or upgrade | m=8, t=1800 | Average | Supported | Not required |
| Continuous CPU Load Exceeds Threshold for Five Minutes | CPU Load >= m for Five Minutes, Continuously for t seconds (300<=t<=2592000) | Node expansion or upgrade | m=8, t=1800 | Critical | Supported | Not required |
| The memory usage consistently goes over the limit. | Memory Usage >= m, Continuously for t seconds (300<=t<=2592000) | Node expansion or upgrade | m=85, t=1800 | Critical | Supported | Supported |
| The total number of system processes consistently goes over the limit. | Total System Processes >= m, Continuously for t seconds (300<=t<=2592000) | Manually check | m=10000, t=1800 | Critical | Supported | Supported |
| Node file handle usage consistently goes over the limit. | Node File Handle Usage >= m, Continuously for t seconds (300<=t<=2592000) | Manually check | m=85, t=1800 | Average | Supported | Not required |
| Node TCP connection count consistently exceeds the threshold. | Node TCP Connection Count >= m, Continuously for t seconds (300<=t<=2592000) | Inspect for potential connection leaks. | m=10000, t=1800 | Average | Supported | Not required |
| Node memory usage exceeds the limit | The cumulative memory usage configuration of all roles on the node surpasses the physical memory threshold of the node. | Adjust the allocation of heap memory for the node process. | 90% | Critical | Supported | Not required |
| Ping to the metadata database failed. | CDB heartbeat has not been reported on schedule. | - | - | - | - | - |
| The utilization of single disk capacity keeps exceeding the threshold. | The utilization of single disk capacity is greater than or equal to 'm', persisting for 't' seconds (300<=t<=2592000). | Node expansion or upgrade | m=0.85, t=1800 | Critical | Supported | Supported |
| The utilization rate of single disk IO devices consistently surpasses the threshold. | The utilization rate of single disk IO devices is greater than or equal to 'm', persisting for 't' seconds (300<=t<=2592000). | Node expansion or upgrade | m=0.85, t=1800 | Critical | Supported | Supported |
| The usage rate of single disk INODES consistently exceeds the threshold. | The usage rate of single disk INODES is greater than or equal to 'm', persisting for 't' seconds (300<=t<=2592000). | Node expansion or upgrade | m=0.85, t=1800 | Critical | Supported | Supported |
| The difference between the sub-machine's UTC time and NTP time exceeds the threshold. | The discrepancy between the sub-machine's UTC time and NTP time surpasses the threshold (measured in milliseconds). | 1. Ensure the NTP daemon is in operational status. 2. Ensure that network communication with the NTP server is functioning properly. | The difference equals 30000. | Critical | Supported | Supported |
| Automatic compensation for faulty nodes. | Upon enabling the automatic compensation feature, should task nodes or router nodes malfunction, the system will automatically purchase replacements of the same model and configuration to compensate. | 1. Compensation replacement has been successfully executed, no further attention is required. 2. Compensation replacement has failed. Please proceed to the console to manually terminate and repurchase nodes for replacement. | - | Average | Supported | Supported |
| Node failure | There are faulty nodes within the cluster. | Please proceed to the console for resolution or submit a ticket to liaise with a specialist for assistance. | - | Critical | Not required | Supported |
| Node disk IO is exhibiting abnormalities. | Node disk IO is exhibiting abnormalities (Detection is based on device IOPS and IO settings usage rate, applicable to certain IO abnormal situations). | The node's disk IO is exhibiting abnormalities. Resolution method: The issue may be due to IO Hang or disk abnormalities. | - | Critical | Supported | Not required |
HDFS | The total number of HDFS files consistently surpasses the threshold. | The total number of files in the cluster is greater than or equal to m, for a duration of t seconds (where 300 <= t <= 2592000). | Increasing the memory allocation for the namenode. | m=50,000,000, t=1800 | Critical | Supported | Not required |
| The total number of HDFS blocks consistently exceeds the threshold. | The total number of blocks in the cluster is greater than or equal to m, for a duration of t seconds (where 300 <= t <= 2592000). | Increase the memory allocation for the namenode or enlarge the block size. | m=50,000,000, t=1800 | Critical | Supported | Not required |
| The quantity of HDFS data nodes marked as 'Dead' persistently surpasses the threshold. | The quantity of data nodes marked as 'Dead' is greater than or equal to m, for a duration of t seconds (where 300 <= t <= 2592000). | Manually check | m=1, t=1800 | Average | Supported | Not required |
| The utilization rate of HDFS storage space persistently surpasses the threshold. | The utilization rate of HDFS storage space is greater than or equal to m, for a duration of t seconds (where 300 <= t <= 2592000). | Clean up files within HDFS or expand the cluster capacity. | m=85, t=1800 | Critical | Supported | Supported |
| A primary-secondary switch has occurred in the NameNode. | A primary-secondary switch has occurred in the NameNode. | Investigate the cause of the NameNode switch. | - | Critical | Supported | Supported |
| The delay in processing NameNode RPC requests consistently exceeds the threshold. | The delay in processing RPC requests is greater than or equal to m milliseconds, for a duration of t seconds (where 300 <= t <= 2592000). | Manually check | m=300, t=300 | Critical | Supported | Not required |
| The current number of connections to the NameNode consistently surpasses the threshold. | The current number of NameNode connections is >= m, with a duration of t seconds (300<=t<=2592000). | Manually check | m=2000, t=1800 | Average | Supported | Not required |
| A full GC event has occurred in the NameNode. | A full GC event has occurred in the NameNode. | Parameter optimization. | - | Critical | Supported | Supported |
| The JVM memory usage of the NameNode consistently exceeds the threshold. | The JVM memory usage of the NameNode consistently remains >= m, for a duration of t seconds (300<=t<=2592000). | Adjusting the heap memory size of the NameNode. | m=85, t=1800 | Critical | Supported | Supported |
| The delay in processing DataNode RPC requests consistently exceeds the threshold. | The delay in processing RPC requests is greater than or equal to m milliseconds, for a duration of t seconds (where 300 <= t <= 2592000). | Manually check | m=300, t=300 | Average | Supported | Not required |
| The current number of DataNode connections consistently exceeds the threshold. | The current number of DataNode connections remains >= m, for a duration of t seconds (300<=t<=2592000). | Manually check | m=2000, t=1800 | Average | Supported | Not required |
| DataNode experiences full GC | A full GC event has occurred in the NameNode. | Parameter optimization. | - | Average | Supported | Not required |
| The JVM memory usage of the DataNode consistently exceeds the threshold. | The JVM memory usage of the NameNode consistently remains >= m, for a duration of t seconds (300<=t<=2592000). | Adjusting the heap memory size of the DataNode. | m=85, t=1800 | Average | Supported | Supported |
| Both NameNode services in HDFS are in Standby status. | Both NameNode roles are concurrently in Standby status. | Manually check | - | Critical | Supported | Supported |
| The number of HDFS MissingBlocks goes over the limit consistently. | The number of MissingBlocks in the cluster is greater than or equal to 'm', persisting for 't' seconds (300 <= t <= 604800). | It is recommended to investigate the occurrence of data block corruption in HDFS, using the command 'hadoop fsck /' to check the distribution of HDFS files. | m=1, t=1800 | Critical | Supported | Supported |
| The HDFS NameNode enters Safe Mode. | The NameNode enters Safe Mode, persisting for 300 seconds. | It is recommended to investigate the occurrence of data block corruption in HDFS, using the command 'hadoop fsck /' to check the distribution of HDFS files. | - | Critical | Supported | Supported |
| The HDFS NameNode has not performed a Checkpoint for an extended period of time. | The HDFS NameNode has not performed a Checkpoint for an extended period of time. | 1. Inspect the status of the SecondaryNameNode (Standby NameNode). 2. Inspect the parameters 'dfs.namenode.checkpoint.period' and 'dfs.namenode.checkpoint.txns' in the HDFS configuration file 'hdfs-site.xml'. 3. Examine the log information of the HDFS cluster. | m=24 | Average | Supported | Supported |
YARN | The current number of lost NodeManagers in the cluster consistently exceeds the threshold. | The current number of lost NodeManagers in the cluster is >= m, persisting for t seconds (300<=t<=2592000). | Examine the status of the NM process and verify the network connectivity. | m=1, t=1800 | Average | Supported | Not required |
| The number of Pending Containers consistently surpasses the threshold. | The number of pending Containers is >= m, persisting for t seconds (300<=t<=2592000). | Appropriately allocate available resources for YARN tasks. | m=90, t=1800 | Average | Supported | Not required |
| Cluster memory usage consistently goes over the limit. | Memory Usage >= m, Continuously for t seconds (300<=t<=2592000) | Cluster scale-out | m=85, t=1800 | Critical | Supported | Supported |
| The cluster's CPU usage consistently exceeds the threshold. | CPU usage rate is >= m, persisting for t seconds (300<=t<=2592000). | Cluster scale-out | m=85, t=1800 | Critical | Supported | Supported |
| The number of available CPU cores in each queue consistently falls below the threshold. | The number of available CPU cores in any queue is <= m, persisting for t seconds (300<=t<=2592000). | Allocate additional resources to the queue. | m=1, t=1800 | Average | Supported | Not required |
| The available memory in queues consistently goes below the limit. | The available memory in any queue is <= m, persisting for t seconds (300<=t<=2592000). | Allocate additional resources to the queue. | m=1024, t=1800 | Average | Supported | Not required |
| A primary-secondary switch has occurred in the ResourceManager. | A primary-secondary transition has occurred in the ResourceManager. | Inspect the RM process status and review the standby RM log to determine the cause of the primary-secondary switch. | - | Critical | Supported | Supported |
| The ResourceManager has undergone a full Garbage Collection (GC). | The ResourceManager has undergone a comprehensive Garbage Collection (GC). | Parameter optimization. | - | Critical | Supported | Supported |
| The JVM memory usage of the ResourceManager consistently exceeds the threshold. | The JVM memory usage of the RM has consistently remained greater than or equal to 'm' for a duration of 't' seconds (where 300 <= t <= 2592000). | Adjusting the heap memory size of the ResourceManager. | m=85, t=1800 | Critical | Supported | Supported |
| The NodeManager has undergone a full Garbage Collection (GC). | The NodeManager has undergone a full Garbage Collection (GC). | Parameter optimization. | - | Average | Supported | Not required |
| The available memory of the NodeManager consistently falls below the threshold. | The available memory of a single NodeManager has consistently remained less than or equal to 'm' for a duration of 't' seconds (where 300 <= t <= 2592000). | Modifying the heap memory size of the NodeManager. | m=1, t=1800 | Average | Supported | Not required |
| The JVM memory usage of the NodeManager consistently exceeds the threshold. | The JVM memory usage of the NodeManager has consistently been greater than or equal to 'm' for a duration of 't' seconds (where 300 <= t <= 2592000). | Modifying the heap memory size of the NodeManager. | m=85, t=1800 | Average | Supported | Not required |
| YARN ResourceManager has no active status | YARN ResourceManager has no active status | Manually check | t=90 | Critical | Supported | Supported |
| The Yarn Application job has failed to execute. | The Yarn Application job has failed to execute. | Manually check | m=1, t=300 | Average | Supported | Not required |
HBase | The cluster persistently exceeds the threshold for the number of Regions in Transition (RIT). | The cluster is in a state where the number of RIT Regions is >= m, persisting for t seconds (300<=t<=2592000). | For HBase versions 2.0 and below, use hbase hbck -fixAssignment. | m=1, t=60 | Critical | Supported | Supported |
| The number of dead RegionServers in the cluster consistently exceeds the threshold. | The number of dead RegionServers in the cluster is >= m, persisting for t seconds (300<=t<=2592000). | Manually check | m=1, t=300 | Average | Supported | Supported |
| The average number of REGIONS per RS in the cluster consistently exceeds the threshold. | The average number of REGIONS per RegionServer in the cluster is >= m, persisting for t seconds (300<=t<=2592000). | Node expansion or upgrade | m=300, t=1800 | Average | Supported | Supported |
| HMaster undergoes a full Garbage Collection (GC). | HMaster has undergone a full Garbage Collection (GC). | Parameter optimization. | m=5, t=300 | Average | Supported | Supported |
| The JVM memory usage of the HMaster consistently exceeds the threshold. | The JVM memory usage of the HMaster is >= m, persisting for t seconds (300<=t<=2592000). | Adjusting the heap memory size of HMaster. | m=85, t=1800 | Critical | Supported | Supported |
| The current number of connections to HMaster consistently exceeds the threshold. | The current number of connections to HMaster is >= m, persisting for t seconds (300<=t<=2592000). | Manually check | m=1000, t=1800 | Average | Supported | Not required |
| RegionServer undergoes a full Garbage Collection (GC). | RegionServer undergoes a full Garbage Collection (GC). | Parameter optimization. | m=5, t=300 | Critical | Supported | Not required |
| The JVM memory usage of the RegionServer consistently exceeds the threshold. | The JVM memory usage of the RegionServer is >= m, persisting for t seconds (300<=t<=2592000). | Modifying the heap memory size of RegionServer. | m=85, t=1800 | Average | Supported | Not required |
| The current number of RPC connections to RegionServer consistently surpasses the threshold. | The current number of RPC connections to RegionServer is >= m, persisting for t seconds (300<=t<=2592000). | Manually check | m=1000, t=1800 | Average | Supported | Not required |
| The number of RegionServer Storefiles consistently surpasses the threshold. | The number of RegionServer Storefiles is >= m, persisting for t seconds (300<=t<=2592000). | It is recommended to execute a major compaction. | m=50000, t=1800 | Average | Supported | Not required |
| Both HMaster services in HBASE are in Standby status. | Both HMaster roles are concurrently in Standby status. | Manually check | - | Critical | Supported | Supported |
| A primary-secondary switch has occurred in HMaster. | A primary-secondary switch has occurred in HMaster. | Investigate through the HMaster service logs. | - | Critical | Supported | Supported |
Hive | HiveServer2 has undergone a full Garbage Collection (GC). | HiveServer2 has undergone a full Garbage Collection (GC). | Parameter optimization. | m=5, t=300 | Critical | Supported | Supported |
| The JVM memory usage of HiveServer2 consistently exceeds the threshold. | The HiveServer2 JVM memory usage rate is >= m, persisting for t seconds (300<=t<=2592000). | Adjust the heap memory size of HiveServer2. | m=85, t=1800 | Critical | Supported | Supported |
| A full GC event has occurred in HiveMetaStore. | A full GC event has occurred in HiveMetaStore. | Parameter optimization. | m=5, t=300 | Average | Supported | Supported |
| A full GC event has occurred in HiveWebHcat. | A full GC event has occurred in HiveWebHcat. | Parameter optimization. | m=5, t=300 | Average | Supported | Supported |
Zookeeper | The number of Zookeeper connections consistently exceeds the threshold. | The number of Zookeeper connections is >= m, persisting for t seconds (300<=t<=2592000). | Manually check | m=65535, t=1800 | Average | Supported | Not required |
| The number of ZNode nodes consistently surpasses the threshold. | The number of ZNode nodes is >= m, persisting for t seconds (300<=t<=2592000). | Manually check | m=2000, t=1800 | Average | Supported | Not required |
| A leader switch has occurred in Zookeeper. | A leader switch has occurred in Zookeeper. | Investigate through the Zookeeper service logs. | - | Critical | Supported | Supported |
Impala | The ImpalaCatalog JVM memory usage consistently exceeds the threshold. | The ImpalaCatalog JVM memory usage rate is >=m, persisting for t seconds (300<=t<=604800). | Adjust the heap memory size of ImpalaCatalog. | m=0.85, t=1800 | Average | Supported | Not required |
| The ImpalaDaemon JVM memory usage consistently surpasses the threshold. | The ImpalaDaemon JVM memory usage rate is >=m, persisting for t seconds (300<=t<=604800). | Modify the heap memory size of ImpalaDaemon. | m=0.85, t=1800 | Average | Supported | Not required |
| The number of client connections to the Impala Beeswax API exceeds the threshold. | The number of client connections to the Impala Beeswax API is >= m. | Adjust the number of fs_service_threads in the impalad.flgs configuration via the console. | m=64,t=120 | Critical | Supported | Supported |
| Number of Impala HS2 client connections exceeds the limit | The number of Impala HS2 client connections is >= m. | Adjust the number of fs_service_threads in the impalad.flgs configuration via the console. | m=64,t=120 | Critical | Supported | Supported |
| The runtime of the Query surpasses the threshold. | The runtime of the Query surpasses the threshold of >= m seconds. | Manually check | - | Critical | Supported | Not required |
| The total number of failed Query executions exceeds the threshold. | The failure rate of Query execution surpasses the threshold of >= m, with a statistical time granularity of t seconds (300 <= t <= 604800). | Manually check | m=1,t=300 | Critical | Supported | Not required |
| The total number of submitted Queries exceeds the threshold. | The total number of failed Query executions surpasses the threshold of >= m, with a statistical time granularity of t seconds (300 <= t <= 604800). | Manually check | m=1,t=300 | Critical | Supported | Not required |
| The failure rate of executing Query exceeds the threshold. | The total number of submitted Queries surpasses the threshold of >= m, with a statistical time granularity of t seconds (300 <= t <= 604800). | Manually check | m=1,t=300 | Critical | Supported | Not required |
PrestoSQL | Number of failed nodes of PrestoSQL consistently goes over the limit | The number of current failed nodes in PrestoSQL is >= m, persisting for a duration of t seconds (300 <= t <= 604800). | Manually check | m=1, t=1800 | Critical | Supported | Supported |
| Number of queued resources in the PrestoSQL resource group consistently goes over the limit | The number of queued tasks in the PrestoSQL resource group is >= m, persisting for a duration of t seconds (300 <= t <= 604800). | Parameter optimization. | m=5000, t=1800 | Critical | Supported | Supported |
| Number of failed queries per minute of PrestoSQL goes over the limit | The number of failed PrestoSQL queries is >= m. | Manually check | m=1, t=1800 | Critical | Supported | Not required |
| Full GC happened in PrestoSQLCoordinator | Full GC happened in PrestoSQLCoordinator | Parameter optimization. | - | Average | Supported | Not required |
| The JVM memory usage of PrestoSQLCoordinator consistently exceeds the threshold. | The JVM memory usage rate of PrestoSQLCoordinator is >= m, persisting for a duration of t seconds (300 <= t <= 604800). | Adjust the heap memory size of PrestoSQLCoordinator. | m=0.85, t=1800 | Critical | Supported | Supported |
| Full GC occurred in PrestoSQLWorker. | Full GC occurred in PrestoSQLWorker. | Parameter optimization. | - | Average | Supported | Not required |
| The JVM memory usage of PrestoSQLWorker consistently surpasses the threshold. | The JVM memory usage rate of PrestoSQLWorker is >= m, persisting for a duration of t seconds (300 <= t <= 604800). | Adjust the heap memory size of PrestoSQLWorker. | m=0.85, t=1800 | Critical | Supported | Not required |
Presto | Number of failed nodes of Presto consistently goes over the limit | The number of failed nodes in Presto is >= m, persisting for a duration of t seconds (300 <= t <= 604800). | Manually check | m=1, t=1800 | Critical | Supported | Supported |
| Number of queued resources in the Presto resource group consistently goes over the limit | The number of queued tasks in the Presto resource group is >= m, persisting for a duration of t seconds (300 <= t <= 604800). | Parameter optimization. | m=5000, t=1800 | Critical | Supported | Supported |
| Number of failed queries per minute of Presto goes over the limit | Number of failed queries in Presto is >= m | Manually check | m=1, t=1800 | Critical | Supported | Not required |
| Full GC event occurred in PrestoCoordinator | Full GC event occurred in PrestoCoordinator | Parameter optimization. | - | Average | Supported | Not required |
| Continuous usage rate of JVM memory in PrestoCoordinator exceeds the threshold | The usage rate of JVM memory in PrestoCoordinator is >= m, persisting for a duration of t seconds (300 <= t <= 604800). | Adjust the heap memory size of PrestoCoordinator. | m=0.85, t=1800 | Average | Supported | Supported |
| Full GC event occurred in PrestoWorker. | Full GC event occurred in PrestoWorker. | Parameter optimization. | - | Average | Supported | Not required |
| Continuous usage rate of JVM memory in PrestoWorker exceeds the threshold. | The usage rate of JVM memory in PrestoWorker is >= m, persisting for a duration of t seconds (300 <= t <= 604800). | Adjust the heap memory size of PrestoWorker. | m=0.85, t=1800 | Critical | Supported | Not required |
Alluxio | The total number of current Alluxio Workers consistently falls below the threshold. | The total number of current Alluxio Workers consistently falls below the threshold <= m, persisting for a duration of t seconds (300 <= t <= 604800). | Manually check | m=1, t=1800 | Critical | Supported | Not required |
| Alluxio worker layer resource usage consistently goes over the limit. | The capacity usage rate of the current Alluxio Worker layer is >= m, persisting for a duration of t seconds (300 <= t <= 604800). | Parameter optimization. | m=0.85, t=1800 | Critical | Supported | Not required |
| A full GC event has occurred in AlluxioMaster. | A full GC event has occurred in AlluxioMaster. | Manually check | - | Average | Supported | Not required |
| The JVM memory usage rate in AlluxioMaster consistently exceeds the threshold. | The JVM memory usage rate in AlluxioMaster is >= m, persisting for a duration of t seconds (300 <= t <= 604800). | Adjust the heap memory size of the AlluxioWorker. | m=0.85, t=1800 | Critical | Supported | Supported |
| A full GC event has occurred in AlluxioWorker. | A full GC event has occurred in AlluxioWorker. | Manually check | - | Average | Supported | Not required |
| The JVM memory usage rate in AlluxioWorker consistently exceeds the threshold. | The JVM memory usage rate in AlluxioWorker is >= m, persisting for a duration of t seconds (300 <= t <= 604800). | Adjust the heap memory size of the AlluxioMaster. | m=0.85, t=1800 | Critical | Supported | Supported |
kudu | The degree of imbalance of cluster replicas exceeds the limit. | The degree of imbalance of cluster replicas is >= m, persisting for a duration of t seconds (300 <= t <= 3600). | Implement balance among replicas using the rebalance command. | m=100, t=300 | Average | Supported | Supported |
| Number of hybrid clock errors exceeds the limit | The number of hybrid clock errors is >= m, persisting for a duration of t seconds (300 <= t <= 3600). | Ensure the NTP daemon is operational and network communication with the NTP server is functioning properly. | m=5000000, t=300 | Average | Supported | Supported |
| The number of tablets in operation exceeds the threshold. | The number of tablets in operation is >= m, persisting for a duration of t seconds (300 <= t <= 3600). | An excessive number of tablets on a single node can impact performance. It is advisable to clean unnecessary tables and partitions, or consider appropriate expansion. | m=1000, t=300 | Average | Supported | Supported |
| The number of tablets in a failed state exceeds the threshold. | The number of tablets in a failed state is >= m, persisting for a duration of t seconds (300 <= t <= 3600). | Verify whether any disks are unavailable or data files are damaged. | m=1, t=300 | Average | Supported | Supported |
| Number of failed data directories exceeds the limit. | The number of data directories in a failed state is >= m, persisting for a duration of t seconds (300 <= t <= 3600). | Verify whether the paths configured in the fs_data_dirs parameter are available. | m=1, t=300 | Critical | Supported | Supported |
| Number of fully-occupied data directories exceeds the limit | The number of fully-occupied data directories is >= m, persisting for a duration of t seconds (120 <= t <= 3600). | Purge unnecessary data files, or appropriately expand capacity. | m=1, t=120 | Critical | Supported | Supported |
| Number of write requests rejected due to queue overload exceeds the limit. | The number of write requests rejected due to queue overload is >= m, persisting for a duration of t seconds (300 <= t <= 3600). | Inspect for the presence of write hotspots or a disproportionately low number of working threads. | m=10, t=300 | Average | Supported | Not required |
| The number of expired scanners exceeds the threshold. | The number of expired scanners is >= m, persisting for a duration of t seconds (300 <= t <= 3600). | Upon completion of data reading, remember to invoke the close method of the scanner. | m=100, t=300 | Average | Supported | Supported |
| Number of error logs exceeds the limit | The number of error logs is >= m, persisting for a duration of t seconds (300 <= t <= 3600). | Manually check | m=10, t=300 | Average | Supported | Supported |
| The number of RPC requests waiting in the queue that have exceeded the timeout threshold. | The number of RPC requests waiting in the queue that have exceeded the timeout threshold is >= m, persisting for a duration of t seconds (300 <= t <= 3600). | Inspect whether the system load is excessively high. | m=100, t=300 | Average | Supported | Supported |
Kerberos | The response time of Kerberos consistently exceeds the threshold. | The response time of Kerberos is >= m (measured in milliseconds), persisting for a duration of t seconds (300 <= t <= 604800). | Manually check | m=100,t=1800 | Critical | Supported | Supported |
Clusters | The execution of the auto-scaling strategy has failed. | 1. The execution of the expansion rule has failed due to an insufficient number of elastic IPs bound to the cluster subnet. 2. The execution of the expansion rule has failed due to an insufficient stock of preset expansion resource specifications. 3. The execution of the expansion rule has failed due to insufficient account balance. 4. Internal error. Please check and try again. | 1. Switch to another subnet within the same VPC. 2. Consider switching to a resource specification with ample availability or submit a ticket to contact our internal development team. 3. Recharge your account balance to ensure sufficient funds are available. 4. Submit a ticket to get in touch with our internal development team. | - | Critical | Not required | Supported |
| Execution of auto-scaling policy timed out | 1. The cluster is currently in a cooling-off period, temporarily preventing any scaling operations. 2. The current expiration retry time setting is too short, preventing the rule from triggering any scaling operations within the retry period. 3. The cluster status is not in a state that prevents scaling. | 1. Adjust the cooling-off period of the rule. 2. It is recommended to extend the expiration retry period. 3. Please retry later or submit a ticket to contact our internal development team. | - | Critical | Not required | Supported |
| The auto-scaling policy is not triggered. | 1. Without setting the resource specifications for scaling, the scaling rule cannot be triggered. 2. The elastic resources have reached the maximum node limit, preventing the triggering of scaling. 3. The elastic resources have reached the minimum node limit, preventing the triggering of downscaling. 4. The execution time range for time-based scaling has expired. 5. Without elastic resources in the cluster, the downscaling rule cannot be triggered. | 1. To add a scaling specification configuration, please set at least one elastic resource specification. 2. Elastic resources have exceeded the maximum node limit. If further expansion is required, consider adjusting the maximum node limit. 3. Elastic resources have reached the minimum node limit. If further contraction is required, consider adjusting the minimum node limit. 4. If you wish to continue using this rule for automatic scaling, please modify the effective time range of the rule. 5. Execute the downscaling rule after supplementing the elastic resources. | - | Average | Supported | Supported |
| The automatic scaling expansion was partially successful. | 1. The resource inventory is less than the expansion quantity, thus only a portion of the resources has been supplemented. 2. The expansion quantity exceeds the actual delivery quantity, thus only a portion of the resources has been supplemented. 3. The expansion of elastic resources has reached the maximum node limit, thus the execution of the expansion rule was partially successful. 4. The reduction of elastic resources has reached the minimum node limit, thus the execution of the reduction rule was partially successful. 5. The elastic IP of the subnet bound to the cluster is insufficient, resulting in a failure to replenish resources. 6. The inventory of the preset expansion resource specification is insufficient, resulting in a failure to replenish resources. 7. The account balance is insufficient, resulting in a failure to replenish resources. | 1. Manually expand the inventory of sufficient resources to supplement the lack of required resources. 2. Manually expand the inventory of sufficient resources to supplement the lack of required resources. 3. Elastic resources have exceeded the maximum node limit. If further expansion is required, consider adjusting the maximum node limit. 4. Elastic resources have reached the minimum node limit. If further contraction is required, consider adjusting the minimum node limit. 5. Switch to another subnet within the same VPC. 6. You may attempt to replace it with a more abundant resource specification or submit a ticket to contact the internal development team. 7. Recharge your account balance to ensure sufficient funds are available. | - | Average | Supported | Supported |
| Anomaly detected in the JVM OLD region. | Anomaly detected in the JVM OLD region. | Manually check | 1. The OLD region has been at 80% capacity continuously for 5 minutes or more. 2. The JVM memory usage has reached 90%. | Critical | Supported | Supported |
| Service role health status has exceeded the timeout period. | The health status of the service role has exceeded the timeout period, with a duration of 't' seconds (180 <= t <= 604800). | The health status of the service role has been exceeding the timeout period continuously on a minute-by-minute basis. Recommended action: Review the log information for the corresponding service role and take action based on the log details. | t=300 | Average | Supported | Not required |
| Service role status abnormal | The health status of the service role is abnormal, with a duration of 't' seconds (180 <= t <= 604800). | The health status of the service role has been continuously unavailable on a minute-by-minute basis. Recommended action: Review the log information for the corresponding service role and take action based on the log details. | t=300 | Critical | Supported | Supported |
| Auto-scaling policy expired | Auto-scaling policy expired | Manually check | / | Average | Not required | Supported |
| Node role process restarted | Node role process restarted | Manually check | / | Average | Not required | Supported |
| Bootstrap script execution failed | Bootstrap script execution failed | Manually check | / | Average | Not required | Supported |