title | description | services | documentationcenter | author | manager | ms.service | ms.custom | ms.devlang | ms.topic | ms.tgt_pltfrm | ms.workload | ms.date | ms.author |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
One or more region servers dead | Microsoft Docs |
Diagnosing and fixing dead region servers on hbase cluster |
hdinsight |
gkanade |
ashitg |
hdinsight |
hdinsightactive |
na |
article |
na |
big-data |
10/04/2017 |
gkanade |
If you are running HBase cluster v3.4 you might have been hit by a potential bug caused by upgrade of jdk to version 1.7.0_151. The symptom we see is region server process starts occupying close to 200% CPU (to verify this run the top command; if there is a process occupying close to 200% CPU get its pid and confirm it is region server process by running ps -aux | grep ) and the region server is essentially rendered dead, causing alerts to fire on HBase Master process and cluster to not function at full capacity.
The mitigation/solution for the problem at a high level (details below) is to:
- Install jdk 1.8 on ALL nodes of the cluster as below:
Run the script action https://raw.githubusercontent.com/Azure/hbase-utils/master/scripts/upgradetojdk18allnodes.sh
Be sure to select the option to run on all nodes. Alternatively, you can log in to every individual node and run the command
"sudo add-apt-repository ppa:openjdk-r/ppa -y && sudo apt-get -y update && sudo apt-get install -y openjdk-8-jdk"
- Go to Ambari UI - https://<clusterdnsname>.azurehdinsight.net; go to HBase->Configs->Advanced->Advanced hbase-env configs and change the variable JAVA_HOME as below:
"export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64."
Save the config change.
-
[Optional but recommended] Flush all tables on cluster. https://blogs.msdn.microsoft.com/azuredatalake/2016/09/19/hdinsight-hbase-how-to-improve-hbase-cluster-restart-time-by-flushing-tables/
-
From Ambari UI again, restart all HBase services that need restart.
-
Depending on the data on cluster, it might take a few minutes to upto an hour for the cluster to reach stable state. The way you confirm the cluster reaches stable state is by either checking HMaster UI (all region servers should be active) from Ambari (refresh) or from headnode run hbase shell and then run status command
To verify that your upgrade was successful check that the relevant HBase processes are started using the appropriate java version - for instance for regionserver check as
"ps -aux | grep regionserver, and verify the version like '''/usr/lib/jvm/java-8-openjdk-amd64/bin/java"