Cluster Event

Last updated: 2023-12-21 16:22:02

Feature Overview

The cluster events encompass both the event list and the event policy.
Event List: A record of key changes or abnormal events occurring within the cluster.
Event Policy: Supports the customization of event monitoring trigger strategies based on business conditions. Monitored events can be set as cluster inspection items.

View event list

1. Log in to the Elastic MapReduce Console. In the cluster list, click on the corresponding cluster ID/Name to access the cluster details page.
2. In the cluster details page, select Cluster Monitoring > Cluster Events > Event List to directly view all operation events of the current cluster.

The severity levels are explained as follows:
Fatal: Abnormal events of nodes or services that require manual intervention. Otherwise, the service will be unavailable. Such events may persist for a period of time.
Severe: Temporarily has not caused service or node unavailability issues. It is a warning category. If left untreated, it may lead to fatal events.
General: Records regular events occurring in the cluster, generally requiring no special handling.
3. Click on the Number of Triggers Today column value to view the event trigger records, as well as related metrics, logs, or scenes of the event records.


Setting Event Policies

1. Log in to the EMR Console, and in the cluster list, click on the corresponding cluster ID/Name to access the cluster details page.
2. In the cluster details page, select Cluster Monitoring > Cluster Events > Event Policy to customize and set event monitoring trigger policies.
3. The event configuration list includes: Event Name, Event Discovery Policy, Severity Level (Fatal/Serious/General), and Monitoring Activation. It supports modifications and saving.

4. Event discovery policies fall into two categories: one type of event is a system-fixed policy event, which does not support user modifications; the other type of event varies according to different customer business standards, and supports user settings.

5. Event policies can be customized to enable or disable event monitoring. Only events with activated monitoring can be selected in the inspection items of cluster inspection. Some events are enabled by default, while others are enabled by default and cannot be disabled. The specific rules are as follows:
Category
Event name
Event Significance
Recommendations & Measures
Default value
Severity
Permitted to close
Enabled by Default
Node
Continuous CPU Utilization Exceeds Threshold
Machine CPU Utilization >= m, for a duration of t seconds (300<=t<=2592000)
Node expansion or upgrade
m=85, t=1800
Critical
Supported
Supported
Average CPU IO Wait Exceeds Threshold
Average CPU iowait utilization >= m within t seconds (300<=t<=2592000)
Manually check
m=60, t=1800
Critical
Supported
Supported
Continuous CPU Load Exceeds Threshold for One Minute
CPU Load >= m for a duration of t seconds (300<=t<=2592000)
Node expansion or upgrade
m=8, t=1800
Average
Supported
Not required
Continuous CPU Load Exceeds Threshold for Five Minutes
CPU Load >= m for Five Minutes, Continuously for t seconds (300<=t<=2592000)
Node expansion or upgrade
m=8, t=1800
Critical
Supported
Not required
The memory usage consistently goes over the limit.
Memory Usage >= m, Continuously for t seconds (300<=t<=2592000)
Node expansion or upgrade
m=85, t=1800
Critical
Supported
Supported
The total number of system processes consistently goes over the limit.
Total System Processes >= m, Continuously for t seconds (300<=t<=2592000)
Manually check
m=10000, t=1800
Critical
Supported
Supported
Node file handle usage consistently goes over the limit.
Node File Handle Usage >= m, Continuously for t seconds (300<=t<=2592000)
Manually check
m=85, t=1800
Average
Supported
Not required
Node TCP connection count consistently exceeds the threshold.
Node TCP Connection Count >= m, Continuously for t seconds (300<=t<=2592000)
Inspect for potential connection leaks.
m=10000, t=1800
Average
Supported
Not required
Node memory usage exceeds the limit
The cumulative memory usage configuration of all roles on the node surpasses the physical memory threshold of the node.
Adjust the allocation of heap memory for the node process.
90%
Critical
Supported
Not required
Ping to the metadata database failed.
CDB heartbeat has not been reported on schedule.
-
-
-
-
-
The utilization of single disk capacity keeps exceeding the threshold.
The utilization of single disk capacity is greater than or equal to 'm', persisting for 't' seconds (300<=t<=2592000).
Node expansion or upgrade
m=0.85, t=1800
Critical
Supported
Supported
The utilization rate of single disk IO devices consistently surpasses the threshold.
The utilization rate of single disk IO devices is greater than or equal to 'm', persisting for 't' seconds (300<=t<=2592000).
Node expansion or upgrade
m=0.85, t=1800
Critical
Supported
Supported
The usage rate of single disk INODES consistently exceeds the threshold.
The usage rate of single disk INODES is greater than or equal to 'm', persisting for 't' seconds (300<=t<=2592000).
Node expansion or upgrade
m=0.85, t=1800
Critical
Supported
Supported
The difference between the sub-machine's UTC time and NTP time exceeds the threshold.
The discrepancy between the sub-machine's UTC time and NTP time surpasses the threshold (measured in milliseconds).
1. Ensure the NTP daemon is in operational status.
2. Ensure that network communication with the NTP server is functioning properly.
The difference equals 30000.
Critical
Supported
Supported
Automatic compensation for faulty nodes.
Upon enabling the automatic compensation feature, should task nodes or router nodes malfunction, the system will automatically purchase replacements of the same model and configuration to compensate.
1. Compensation replacement has been successfully executed, no further attention is required.
2. Compensation replacement has failed. Please proceed to the console to manually terminate and repurchase nodes for replacement.
-
Average
Supported
Supported
Node failure
There are faulty nodes within the cluster.
Please proceed to the console for resolution or submit a ticket to liaise with a specialist for assistance.
-
Critical
Not required
Supported
Node disk IO is exhibiting abnormalities.
Node disk IO is exhibiting abnormalities (Detection is based on device IOPS and IO settings usage rate, applicable to certain IO abnormal situations).
The node's disk IO is exhibiting abnormalities.
Resolution method: The issue may be due to IO Hang or disk abnormalities.
-
Critical
Supported
Not required
HDFS
The total number of HDFS files consistently surpasses the threshold.
The total number of files in the cluster is greater than or equal to m, for a duration of t seconds (where 300 <= t <= 2592000).
Increasing the memory allocation for the namenode.
m=50,000,000, t=1800
Critical
Supported
Not required
The total number of HDFS blocks consistently exceeds the threshold.
The total number of blocks in the cluster is greater than or equal to m, for a duration of t seconds (where 300 <= t <= 2592000).
Increase the memory allocation for the namenode or enlarge the block size.
m=50,000,000, t=1800
Critical
Supported
Not required
The quantity of HDFS data nodes marked as 'Dead' persistently surpasses the threshold.
The quantity of data nodes marked as 'Dead' is greater than or equal to m, for a duration of t seconds (where 300 <= t <= 2592000).
Manually check
m=1, t=1800
Average
Supported
Not required
The utilization rate of HDFS storage space persistently surpasses the threshold.
The utilization rate of HDFS storage space is greater than or equal to m, for a duration of t seconds (where 300 <= t <= 2592000).
Clean up files within HDFS or expand the cluster capacity.
m=85, t=1800
Critical
Supported
Supported
A primary-secondary switch has occurred in the NameNode.
A primary-secondary switch has occurred in the NameNode.
Investigate the cause of the NameNode switch.
-
Critical
Supported
Supported
The delay in processing NameNode RPC requests consistently exceeds the threshold.
The delay in processing RPC requests is greater than or equal to m milliseconds, for a duration of t seconds (where 300 <= t <= 2592000).
Manually check
m=300, t=300
Critical
Supported
Not required
The current number of connections to the NameNode consistently surpasses the threshold.
The current number of NameNode connections is >= m, with a duration of t seconds (300<=t<=2592000).
Manually check
m=2000, t=1800
Average
Supported
Not required
A full GC event has occurred in the NameNode.
A full GC event has occurred in the NameNode.
Parameter optimization.
-
Critical
Supported
Supported
The JVM memory usage of the NameNode consistently exceeds the threshold.
The JVM memory usage of the NameNode consistently remains >= m, for a duration of t seconds (300<=t<=2592000).
Adjusting the heap memory size of the NameNode.
m=85, t=1800
Critical
Supported
Supported
The delay in processing DataNode RPC requests consistently exceeds the threshold.
The delay in processing RPC requests is greater than or equal to m milliseconds, for a duration of t seconds (where 300 <= t <= 2592000).
Manually check
m=300, t=300
Average
Supported
Not required
The current number of DataNode connections consistently exceeds the threshold.
The current number of DataNode connections remains >= m, for a duration of t seconds (300<=t<=2592000).
Manually check
m=2000, t=1800
Average
Supported
Not required
DataNode experiences full GC
A full GC event has occurred in the NameNode.
Parameter optimization.
-
Average
Supported
Not required
The JVM memory usage of the DataNode consistently exceeds the threshold.
The JVM memory usage of the NameNode consistently remains >= m, for a duration of t seconds (300<=t<=2592000).
Adjusting the heap memory size of the DataNode.
m=85, t=1800
Average
Supported
Supported
Both NameNode services in HDFS are in Standby status.
Both NameNode roles are concurrently in Standby status.
Manually check
-
Critical
Supported
Supported
The number of HDFS MissingBlocks goes over the limit consistently.
The number of MissingBlocks in the cluster is greater than or equal to 'm', persisting for 't' seconds (300 <= t <= 604800).
It is recommended to investigate the occurrence of data block corruption in HDFS, using the command 'hadoop fsck /' to check the distribution of HDFS files.
m=1, t=1800
Critical
Supported
Supported
The HDFS NameNode enters Safe Mode.
The NameNode enters Safe Mode, persisting for 300 seconds.
It is recommended to investigate the occurrence of data block corruption in HDFS, using the command 'hadoop fsck /' to check the distribution of HDFS files.
-
Critical
Supported
Supported
The HDFS NameNode has not performed a Checkpoint for an extended period of time.
The HDFS NameNode has not performed a Checkpoint for an extended period of time.
1. Inspect the status of the SecondaryNameNode (Standby NameNode).
2. Inspect the parameters 'dfs.namenode.checkpoint.period' and 'dfs.namenode.checkpoint.txns' in the HDFS configuration file 'hdfs-site.xml'.
3. Examine the log information of the HDFS cluster.
m=24
Average
Supported
Supported
YARN
The current number of lost NodeManagers in the cluster consistently exceeds the threshold.
The current number of lost NodeManagers in the cluster is >= m, persisting for t seconds (300<=t<=2592000).
Examine the status of the NM process and verify the network connectivity.
m=1, t=1800
Average
Supported
Not required
The number of Pending Containers consistently surpasses the threshold.
The number of pending Containers is >= m, persisting for t seconds (300<=t<=2592000).
Appropriately allocate available resources for YARN tasks.
m=90, t=1800
Average
Supported
Not required
Cluster memory usage consistently goes over the limit.
Memory Usage >= m, Continuously for t seconds (300<=t<=2592000)
Cluster scale-out
m=85, t=1800
Critical
Supported
Supported
The cluster's CPU usage consistently exceeds the threshold.
CPU usage rate is >= m, persisting for t seconds (300<=t<=2592000).
Cluster scale-out
m=85, t=1800
Critical
Supported
Supported
The number of available CPU cores in each queue consistently falls below the threshold.
The number of available CPU cores in any queue is <= m, persisting for t seconds (300<=t<=2592000).
Allocate additional resources to the queue.
m=1, t=1800
Average
Supported
Not required
The available memory in queues consistently goes below the limit.
The available memory in any queue is <= m, persisting for t seconds (300<=t<=2592000).
Allocate additional resources to the queue.
m=1024, t=1800
Average
Supported
Not required
A primary-secondary switch has occurred in the ResourceManager.
A primary-secondary transition has occurred in the ResourceManager.
Inspect the RM process status and review the standby RM log to determine the cause of the primary-secondary switch.
-
Critical
Supported
Supported
The ResourceManager has undergone a full Garbage Collection (GC).
The ResourceManager has undergone a comprehensive Garbage Collection (GC).
Parameter optimization.
-
Critical
Supported
Supported
The JVM memory usage of the ResourceManager consistently exceeds the threshold.
The JVM memory usage of the RM has consistently remained greater than or equal to 'm' for a duration of 't' seconds (where 300 <= t <= 2592000).
Adjusting the heap memory size of the ResourceManager.
m=85, t=1800
Critical
Supported
Supported
The NodeManager has undergone a full Garbage Collection (GC).
The NodeManager has undergone a full Garbage Collection (GC).
Parameter optimization.
-
Average
Supported
Not required
The available memory of the NodeManager consistently falls below the threshold.
The available memory of a single NodeManager has consistently remained less than or equal to 'm' for a duration of 't' seconds (where 300 <= t <= 2592000).
Modifying the heap memory size of the NodeManager.
m=1, t=1800
Average
Supported
Not required
The JVM memory usage of the NodeManager consistently exceeds the threshold.
The JVM memory usage of the NodeManager has consistently been greater than or equal to 'm' for a duration of 't' seconds (where 300 <= t <= 2592000).
Modifying the heap memory size of the NodeManager.
m=85, t=1800
Average
Supported
Not required
YARN ResourceManager has no active status
YARN ResourceManager has no active status
Manually check
t=90
Critical
Supported
Supported
The Yarn Application job has failed to execute.
The Yarn Application job has failed to execute.
Manually check
m=1, t=300
Average
Supported
Not required
HBase
The cluster persistently exceeds the threshold for the number of Regions in Transition (RIT).
The cluster is in a state where the number of RIT Regions is >= m, persisting for t seconds (300<=t<=2592000).
For HBase versions 2.0 and below, use hbase hbck -fixAssignment.
m=1, t=60
Critical
Supported
Supported
The number of dead RegionServers in the cluster consistently exceeds the threshold.
The number of dead RegionServers in the cluster is >= m, persisting for t seconds (300<=t<=2592000).
Manually check
m=1, t=300
Average
Supported
Supported
The average number of REGIONS per RS in the cluster consistently exceeds the threshold.
The average number of REGIONS per RegionServer in the cluster is >= m, persisting for t seconds (300<=t<=2592000).
Node expansion or upgrade
m=300, t=1800
Average
Supported
Supported
HMaster undergoes a full Garbage Collection (GC).
HMaster has undergone a full Garbage Collection (GC).
Parameter optimization.
m=5, t=300
Average
Supported
Supported
The JVM memory usage of the HMaster consistently exceeds the threshold.
The JVM memory usage of the HMaster is >= m, persisting for t seconds (300<=t<=2592000).
Adjusting the heap memory size of HMaster.
m=85, t=1800
Critical
Supported
Supported
The current number of connections to HMaster consistently exceeds the threshold.
The current number of connections to HMaster is >= m, persisting for t seconds (300<=t<=2592000).
Manually check
m=1000, t=1800
Average
Supported
Not required
RegionServer undergoes a full Garbage Collection (GC).
RegionServer undergoes a full Garbage Collection (GC).
Parameter optimization.
m=5, t=300
Critical
Supported
Not required
The JVM memory usage of the RegionServer consistently exceeds the threshold.
The JVM memory usage of the RegionServer is >= m, persisting for t seconds (300<=t<=2592000).
Modifying the heap memory size of RegionServer.
m=85, t=1800
Average
Supported
Not required
The current number of RPC connections to RegionServer consistently surpasses the threshold.
The current number of RPC connections to RegionServer is >= m, persisting for t seconds (300<=t<=2592000).
Manually check
m=1000, t=1800
Average
Supported
Not required
The number of RegionServer Storefiles consistently surpasses the threshold.
The number of RegionServer Storefiles is >= m, persisting for t seconds (300<=t<=2592000).
It is recommended to execute a major compaction.
m=50000, t=1800
Average
Supported
Not required
Both HMaster services in HBASE are in Standby status.
Both HMaster roles are concurrently in Standby status.
Manually check
-
Critical
Supported
Supported
A primary-secondary switch has occurred in HMaster.
A primary-secondary switch has occurred in HMaster.
Investigate through the HMaster service logs.
-
Critical
Supported
Supported
Hive
HiveServer2 has undergone a full Garbage Collection (GC).
HiveServer2 has undergone a full Garbage Collection (GC).
Parameter optimization.
m=5, t=300
Critical
Supported
Supported
The JVM memory usage of HiveServer2 consistently exceeds the threshold.
The HiveServer2 JVM memory usage rate is >= m, persisting for t seconds (300<=t<=2592000).
Adjust the heap memory size of HiveServer2.
m=85, t=1800
Critical
Supported
Supported
A full GC event has occurred in HiveMetaStore.
A full GC event has occurred in HiveMetaStore.
Parameter optimization.
m=5, t=300
Average
Supported
Supported
A full GC event has occurred in HiveWebHcat.
A full GC event has occurred in HiveWebHcat.
Parameter optimization.
m=5, t=300
Average
Supported
Supported
Zookeeper
The number of Zookeeper connections consistently exceeds the threshold.
The number of Zookeeper connections is >= m, persisting for t seconds (300<=t<=2592000).
Manually check
m=65535, t=1800
Average
Supported
Not required
The number of ZNode nodes consistently surpasses the threshold.
The number of ZNode nodes is >= m, persisting for t seconds (300<=t<=2592000).
Manually check
m=2000, t=1800
Average
Supported
Not required
A leader switch has occurred in Zookeeper.
A leader switch has occurred in Zookeeper.
Investigate through the Zookeeper service logs.
-
Critical
Supported
Supported
Impala
The ImpalaCatalog JVM memory usage consistently exceeds the threshold.
The ImpalaCatalog JVM memory usage rate is >=m, persisting for t seconds (300<=t<=604800).
Adjust the heap memory size of ImpalaCatalog.
m=0.85, t=1800
Average
Supported
Not required
The ImpalaDaemon JVM memory usage consistently surpasses the threshold.
The ImpalaDaemon JVM memory usage rate is >=m, persisting for t seconds (300<=t<=604800).
Modify the heap memory size of ImpalaDaemon.
m=0.85, t=1800
Average
Supported
Not required
The number of client connections to the Impala Beeswax API exceeds the threshold.
The number of client connections to the Impala Beeswax API is >= m.
Adjust the number of fs_service_threads in the impalad.flgs configuration via the console.
m=64,t=120
Critical
Supported
Supported
Number of Impala HS2 client connections exceeds the limit
The number of Impala HS2 client connections is >= m.
Adjust the number of fs_service_threads in the impalad.flgs configuration via the console.
m=64,t=120
Critical
Supported
Supported
The runtime of the Query surpasses the threshold.
The runtime of the Query surpasses the threshold of >= m seconds.
Manually check
-
Critical
Supported
Not required
The total number of failed Query executions exceeds the threshold.
The failure rate of Query execution surpasses the threshold of >= m, with a statistical time granularity of t seconds (300 <= t <= 604800).
Manually check
m=1,t=300
Critical
Supported
Not required
The total number of submitted Queries exceeds the threshold.
The total number of failed Query executions surpasses the threshold of >= m, with a statistical time granularity of t seconds (300 <= t <= 604800).
Manually check
m=1,t=300
Critical
Supported
Not required
The failure rate of executing Query exceeds the threshold.
The total number of submitted Queries surpasses the threshold of >= m, with a statistical time granularity of t seconds (300 <= t <= 604800).
Manually check
m=1,t=300
Critical
Supported
Not required
PrestoSQL
Number of failed nodes of PrestoSQL consistently goes over the limit
The number of current failed nodes in PrestoSQL is >= m, persisting for a duration of t seconds (300 <= t <= 604800).
Manually check
m=1, t=1800
Critical
Supported
Supported
Number of queued resources in the PrestoSQL resource group consistently goes over the limit
The number of queued tasks in the PrestoSQL resource group is >= m, persisting for a duration of t seconds (300 <= t <= 604800).
Parameter optimization.
m=5000, t=1800
Critical
Supported
Supported
Number of failed queries per minute of PrestoSQL goes over the limit
The number of failed PrestoSQL queries is >= m.
Manually check
m=1, t=1800
Critical
Supported
Not required
Full GC happened in PrestoSQLCoordinator
Full GC happened in PrestoSQLCoordinator
Parameter optimization.
-
Average
Supported
Not required
The JVM memory usage of PrestoSQLCoordinator consistently exceeds the threshold.
The JVM memory usage rate of PrestoSQLCoordinator is >= m, persisting for a duration of t seconds (300 <= t <= 604800).
Adjust the heap memory size of PrestoSQLCoordinator.
m=0.85, t=1800
Critical
Supported
Supported
Full GC occurred in PrestoSQLWorker.
Full GC occurred in PrestoSQLWorker.
Parameter optimization.
-
Average
Supported
Not required
The JVM memory usage of PrestoSQLWorker consistently surpasses the threshold.
The JVM memory usage rate of PrestoSQLWorker is >= m, persisting for a duration of t seconds (300 <= t <= 604800).
Adjust the heap memory size of PrestoSQLWorker.
m=0.85, t=1800
Critical
Supported
Not required
Presto
Number of failed nodes of Presto consistently goes over the limit
The number of failed nodes in Presto is >= m, persisting for a duration of t seconds (300 <= t <= 604800).
Manually check
m=1, t=1800
Critical
Supported
Supported
Number of queued resources in the Presto resource group consistently goes over the limit
The number of queued tasks in the Presto resource group is >= m, persisting for a duration of t seconds (300 <= t <= 604800).
Parameter optimization.
m=5000, t=1800
Critical
Supported
Supported
Number of failed queries per minute of Presto goes over the limit
Number of failed queries in Presto is >= m
Manually check
m=1, t=1800
Critical
Supported
Not required
Full GC event occurred in PrestoCoordinator
Full GC event occurred in PrestoCoordinator
Parameter optimization.
-
Average
Supported
Not required
Continuous usage rate of JVM memory in PrestoCoordinator exceeds the threshold
The usage rate of JVM memory in PrestoCoordinator is >= m, persisting for a duration of t seconds (300 <= t <= 604800).
Adjust the heap memory size of PrestoCoordinator.
m=0.85, t=1800
Average
Supported
Supported
Full GC event occurred in PrestoWorker.
Full GC event occurred in PrestoWorker.
Parameter optimization.
-
Average
Supported
Not required
Continuous usage rate of JVM memory in PrestoWorker exceeds the threshold.
The usage rate of JVM memory in PrestoWorker is >= m, persisting for a duration of t seconds (300 <= t <= 604800).
Adjust the heap memory size of PrestoWorker.
m=0.85, t=1800
Critical
Supported
Not required
Alluxio
The total number of current Alluxio Workers consistently falls below the threshold.
The total number of current Alluxio Workers consistently falls below the threshold <= m, persisting for a duration of t seconds (300 <= t <= 604800).
Manually check
m=1, t=1800
Critical
Supported
Not required
Alluxio worker layer resource usage consistently goes over the limit.
The capacity usage rate of the current Alluxio Worker layer is >= m, persisting for a duration of t seconds (300 <= t <= 604800).
Parameter optimization.
m=0.85, t=1800
Critical
Supported
Not required
A full GC event has occurred in AlluxioMaster.
A full GC event has occurred in AlluxioMaster.
Manually check
-
Average
Supported
Not required
The JVM memory usage rate in AlluxioMaster consistently exceeds the threshold.
The JVM memory usage rate in AlluxioMaster is >= m, persisting for a duration of t seconds (300 <= t <= 604800).
Adjust the heap memory size of the AlluxioWorker.
m=0.85, t=1800
Critical
Supported
Supported
A full GC event has occurred in AlluxioWorker.
A full GC event has occurred in AlluxioWorker.
Manually check
-
Average
Supported
Not required
The JVM memory usage rate in AlluxioWorker consistently exceeds the threshold.
The JVM memory usage rate in AlluxioWorker is >= m, persisting for a duration of t seconds (300 <= t <= 604800).
Adjust the heap memory size of the AlluxioMaster.
m=0.85, t=1800
Critical
Supported
Supported
kudu
The degree of imbalance of cluster replicas exceeds the limit.
The degree of imbalance of cluster replicas is >= m, persisting for a duration of t seconds (300 <= t <= 3600).
Implement balance among replicas using the rebalance command.
m=100, t=300
Average
Supported
Supported
Number of hybrid clock errors exceeds the limit
The number of hybrid clock errors is >= m, persisting for a duration of t seconds (300 <= t <= 3600).
Ensure the NTP daemon is operational and network communication with the NTP server is functioning properly.
m=5000000, t=300
Average
Supported
Supported
The number of tablets in operation exceeds the threshold.
The number of tablets in operation is >= m, persisting for a duration of t seconds (300 <= t <= 3600).
An excessive number of tablets on a single node can impact performance. It is advisable to clean unnecessary tables and partitions, or consider appropriate expansion.
m=1000, t=300
Average
Supported
Supported
The number of tablets in a failed state exceeds the threshold.
The number of tablets in a failed state is >= m, persisting for a duration of t seconds (300 <= t <= 3600).
Verify whether any disks are unavailable or data files are damaged.
m=1, t=300
Average
Supported
Supported
Number of failed data directories exceeds the limit.
The number of data directories in a failed state is >= m, persisting for a duration of t seconds (300 <= t <= 3600).
Verify whether the paths configured in the fs_data_dirs parameter are available.
m=1, t=300
Critical
Supported
Supported
Number of fully-occupied data directories exceeds the limit
The number of fully-occupied data directories is >= m, persisting for a duration of t seconds (120 <= t <= 3600).
Purge unnecessary data files, or appropriately expand capacity.
m=1, t=120
Critical
Supported
Supported
Number of write requests rejected due to queue overload exceeds the limit.
The number of write requests rejected due to queue overload is >= m, persisting for a duration of t seconds (300 <= t <= 3600).
Inspect for the presence of write hotspots or a disproportionately low number of working threads.
m=10, t=300
Average
Supported
Not required
The number of expired scanners exceeds the threshold.
The number of expired scanners is >= m, persisting for a duration of t seconds (300 <= t <= 3600).
Upon completion of data reading, remember to invoke the close method of the scanner.
m=100, t=300
Average
Supported
Supported
Number of error logs exceeds the limit
The number of error logs is >= m, persisting for a duration of t seconds (300 <= t <= 3600).
Manually check
m=10, t=300
Average
Supported
Supported
The number of RPC requests waiting in the queue that have exceeded the timeout threshold.
The number of RPC requests waiting in the queue that have exceeded the timeout threshold is >= m, persisting for a duration of t seconds (300 <= t <= 3600).
Inspect whether the system load is excessively high.
m=100, t=300
Average
Supported
Supported
Kerberos
The response time of Kerberos consistently exceeds the threshold.
The response time of Kerberos is >= m (measured in milliseconds), persisting for a duration of t seconds (300 <= t <= 604800).
Manually check
m=100,t=1800
Critical
Supported
Supported
Clusters
The execution of the auto-scaling strategy has failed.
1. The execution of the expansion rule has failed due to an insufficient number of elastic IPs bound to the cluster subnet.
2. The execution of the expansion rule has failed due to an insufficient stock of preset expansion resource specifications.
3. The execution of the expansion rule has failed due to insufficient account balance.
4. Internal error. Please check and try again.
1. Switch to another subnet within the same VPC.
2. Consider switching to a resource specification with ample availability or submit a ticket to contact our internal development team.
3. Recharge your account balance to ensure sufficient funds are available.
4. Submit a ticket to get in touch with our internal development team.
-
Critical
Not required
Supported
Execution of auto-scaling policy timed out
1. The cluster is currently in a cooling-off period, temporarily preventing any scaling operations.
2. The current expiration retry time setting is too short, preventing the rule from triggering any scaling operations within the retry period.
3. The cluster status is not in a state that prevents scaling.
1. Adjust the cooling-off period of the rule.
2. It is recommended to extend the expiration retry period.
3. Please retry later or submit a ticket to contact our internal development team.
-
Critical
Not required
Supported
The auto-scaling policy is not triggered.
1. Without setting the resource specifications for scaling, the scaling rule cannot be triggered.
2. The elastic resources have reached the maximum node limit, preventing the triggering of scaling.
3. The elastic resources have reached the minimum node limit, preventing the triggering of downscaling.
4. The execution time range for time-based scaling has expired.
5. Without elastic resources in the cluster, the downscaling rule cannot be triggered.
1. To add a scaling specification configuration, please set at least one elastic resource specification.
2. Elastic resources have exceeded the maximum node limit. If further expansion is required, consider adjusting the maximum node limit.
3. Elastic resources have reached the minimum node limit. If further contraction is required, consider adjusting the minimum node limit.
4. If you wish to continue using this rule for automatic scaling, please modify the effective time range of the rule.
5. Execute the downscaling rule after supplementing the elastic resources.
-
Average
Supported
Supported
The automatic scaling expansion was partially successful.
1. The resource inventory is less than the expansion quantity, thus only a portion of the resources has been supplemented.
2. The expansion quantity exceeds the actual delivery quantity, thus only a portion of the resources has been supplemented.
3. The expansion of elastic resources has reached the maximum node limit, thus the execution of the expansion rule was partially successful.
4. The reduction of elastic resources has reached the minimum node limit, thus the execution of the reduction rule was partially successful.
5. The elastic IP of the subnet bound to the cluster is insufficient, resulting in a failure to replenish resources.
6. The inventory of the preset expansion resource specification is insufficient, resulting in a failure to replenish resources.
7. The account balance is insufficient, resulting in a failure to replenish resources.
1. Manually expand the inventory of sufficient resources to supplement the lack of required resources.
2. Manually expand the inventory of sufficient resources to supplement the lack of required resources.
3. Elastic resources have exceeded the maximum node limit. If further expansion is required, consider adjusting the maximum node limit.
4. Elastic resources have reached the minimum node limit. If further contraction is required, consider adjusting the minimum node limit.
5. Switch to another subnet within the same VPC.
6. You may attempt to replace it with a more abundant resource specification or submit a ticket to contact the internal development team.
7. Recharge your account balance to ensure sufficient funds are available.
-
Average
Supported
Supported
Anomaly detected in the JVM OLD region.
Anomaly detected in the JVM OLD region.
Manually check
1. The OLD region has been at 80% capacity continuously for 5 minutes or more.
2. The JVM memory usage has reached 90%.
Critical
Supported
Supported
Service role health status has exceeded the timeout period.
The health status of the service role has exceeded the timeout period, with a duration of 't' seconds (180 <= t <= 604800).
The health status of the service role has been exceeding the timeout period continuously on a minute-by-minute basis.
Recommended action: Review the log information for the corresponding service role and take action based on the log details.
t=300
Average
Supported
Not required
Service role status abnormal
The health status of the service role is abnormal, with a duration of 't' seconds (180 <= t <= 604800).
The health status of the service role has been continuously unavailable on a minute-by-minute basis.
Recommended action: Review the log information for the corresponding service role and take action based on the log details.
t=300
Critical
Supported
Supported
Auto-scaling policy expired
Auto-scaling policy expired
Manually check
/
Average
Not required
Supported
Node role process restarted
Node role process restarted
Manually check
/
Average
Not required
Supported
Bootstrap script execution failed
Bootstrap script execution failed
Manually check
/
Average
Not required
Supported