{"id":86,"date":"2015-06-08T01:01:24","date_gmt":"2015-06-08T01:01:24","guid":{"rendered":"http:\/\/clayb.net\/blog\/?p=86"},"modified":"2021-02-16T18:06:03","modified_gmt":"2021-02-16T18:06:03","slug":"finding-hbase-region-locations","status":"publish","type":"post","link":"https:\/\/clayb.net\/blog\/finding-hbase-region-locations\/","title":{"rendered":"Finding HBase Region Locations"},"content":{"rendered":"<h1>HBase Region Locality<\/h1>\n<p>HBase provides information on region locality via JMX per region server via the <tt><a href=\"http:\/\/hbase.apache.org\/book\/hbase_metrics.html\">hbase.regionserver.percentFilesLocal<\/a><\/tt>. However, there is a challenge when running a multi-tenant environment or doing performance analysis. This percent of files local is for the entire region server but of course each region server can serve regions for multiple tables. And further, each region can be made up of multiple store files each with their own location.<\/p>\n<p>If one is doing a performance evaluation for a table, these metrics are not sufficient!<\/p>\n<h1>How to See Each Region<\/h1>\n<p>To see a more detailed breakdown, we can use HDFS to tell us where a file&#8217;s blocks live. Further, we can point HDFS to the files making up a table by looking under the HBase&nbsp;<tt><a href=\"https:\/\/github.com\/apache\/hbase\/blob\/820f629423f21fbd1dcc7a383955443a2595fd5d\/hbase-common\/src\/main\/resources\/hbase-default.xml#L52-L64\">hbase.rootdir<\/a><\/tt> and build up a list of <tt><a href=\"https:\/\/hadoop.apache.org\/docs\/r2.2.0\/api\/org\/apache\/hadoop\/fs\/LocatedFileStatus.html\">LocatedFileStatus<\/a><\/tt> objects for each file. Nicely, <tt>LocatedFileStatus<\/tt> provides <tt><a href=\"https:\/\/hadoop.apache.org\/docs\/r2.2.0\/api\/org\/apache\/hadoop\/fs\/LocatedFileStatus.html#getBlockLocations%28%29\">getBlockLocations()<\/a><\/tt> which can provide the serving hosts for each HDFS block.<\/p>\n<p>Lastly, all we need to do is correlate which region servers have local blocks for regions they are serving; now we can come up with a table locality percentage.<\/p>\n<h1>Implementation<\/h1>\n<p>One can do nifty things in the HBase shell as it is really a full JRuby shell. Particularly, one can enter arbitrary Java to run which works great debugging &#8212; or running performance tests. The following is the needed JRuby, which can be saved to a file and executed via <tt>hbase shell &lt;file name&gt;<\/tt> or simply copy and pasted into the shell.<\/p>\n<pre class=\"brush:ruby\">require 'set'\ninclude Java\nimport org.apache.hadoop.hbase.HBaseConfiguration\nimport org.apache.hadoop.hbase.HColumnDescriptor\nimport org.apache.hadoop.hbase.HConstants\nimport org.apache.hadoop.hbase.HTableDescriptor\nimport org.apache.hadoop.hbase.client.HBaseAdmin\nimport org.apache.hadoop.hbase.client.HTable\nimport org.apache.hadoop.hbase.TableName\nimport org.apache.hadoop.io.Text\n \nimport org.apache.hadoop.fs.FileSystem\nimport org.apache.hadoop.fs.Path\nimport java.util.NoSuchElementException\nimport java.io.FileNotFoundException\n \n# Return a Hash of region UUIDs to hostnames with column family stubs\n#\n# tableName - table to return regions for\n#\n# Example\n# getRegionUUIDs \"TestTable\"\n# # =&gt; {\"3fe594363a2c13a3550f752db147194b\"=&gt;{\"host\" =&gt; \"r1n1.example.com\", \"cfs\" =&gt; {\"f1\" =&gt; {}, \"f2\" =&gt; {}},\n#       \"da19a80cc403daa9a8f82ac9a1253e9d\"=&gt;{\"host\" =&gt; \"r1n2.example.com\", \"cfs\" =&gt; {\"f1\" =&gt; {}, \"f2\" =&gt; {}}}}\n#\ndef getRegionUUIDs(tableName)\n  c = HBaseConfiguration.new()\n  tableNameObj = TableName.valueOf(tableName)\n  t = HTable.new(c, tableNameObj)\n  regions = t.getRegionsInRange(t.getStartKeys[0],\n                                t.getEndKeys[t.getEndKeys.size-1])\n  # get all column families -- XXX do all regions have to host all CF's?\n  cfs = HTable.new(c, tableNameObj).getTableDescriptor.getFamilies().map{ |cf| cf.getNameAsString() }\n \n  r_to_host = regions.map{|r| [r.getRegionInfo().getEncodedName(), Hash[\"host\" =&gt; r.getHostname(), \"cfs\" =&gt; Hash[cfs.map{|cf| [cf, Hash.new()] }]]] }\n \n  Hash[r_to_host]\nend\n \ndef findHDFSBlocks(regions, tableName)\n  # augment regions with HDFS block locations\n  augmented = regions.clone\n  c = HBaseConfiguration.new()\n  fs = FileSystem.newInstance(c)\n  hbase_rootdir = c.select{|r| r.getKey() == \"hbase.rootdir\"}.first.getValue\n  tableNameObj = TableName.valueOf(tableName)\n  nameSpace = tableNameObj.getNamespaceAsString\n  baseTableName = tableNameObj.getQualifierAsString\n  # use the default namespace if nongiven\n  nameSpace = \"default\" if nameSpace == tableName\n \n  regions.each do |r, values|\n    values[\"cfs\"].keys().each do |cf|\n      rPath = Path.new(Pathname.new(hbase_rootdir).join(\"data\", nameSpace, baseTableName, r, cf).to_s)\n      begin\n        files = fs.listFiles(rPath, true)\n      rescue java.io.FileNotFoundException\n        next\n      end\n \n      begin\n        begin\n          fStatus = files.next()\n          hosts = fStatus.getBlockLocations().map { |block| Set.new(block.getHosts().to_a) }\n          augmented[r][\"cfs\"][cf][File.basename(fStatus.getPath().toString())] = hosts\n        rescue NativeException, java.util.NoSuchElementException\n          fStatus = false\n        end\n      end until fStatus == false\n    end\n  end\n  augmented\nend\n \ndef computeLocalityByBlock(regions)\n  non_local_blocks = []\n  regions.each do |r, values|\n    values[\"cfs\"].each do |cf, hFiles|\n      hFiles.each do |id, blocks|\n        blocks.each_index do |idx|\n          non_local_blocks.push(Pathname.new(r).join(cf, id, idx.to_s).to_s) unless blocks[idx].include?(values[\"host\"])\n        end\n      end\n    end\n  end\n  non_local_blocks\nend\n \ndef totalBlocks(regions)\n  regions.map do |r, values|\n    values[\"cfs\"].map do |cf, hFiles|\n      hFiles.map do |id, blocks|\n        blocks.count\n      end\n    end\n  end.flatten().reduce(0, :+)\nend\n \ntables = list\ntables.each do |tableName|\n  puts tableName\n  begin\n    regions = getRegionUUIDs(tableName)\n    hdfs_blocks_by_region = findHDFSBlocks(regions, tableName)\n    non_local_blocks = computeLocalityByBlock(hdfs_blocks_by_region)\n    total_blocks = totalBlocks(hdfs_blocks_by_region)\n    puts non_local_blocks.length().to_f\/total_blocks if total_blocks &gt; 0 # e.g. if table not empty or disabled\n  rescue org.apache.hadoop.hbase.TableNotFoundException\n    true\n  end\nend<\/pre>\n<p>&nbsp;<\/p>\n<p>One will get output of the form table-name newline float of locality percentage (0.0-1.0). Should the table be offline, deleted (<a href=\"https:\/\/hbase.apache.org\/apidocs\/org\/apache\/hadoop\/hbase\/TableNotFoundException.html\">TableNotFoundException<\/a>), an HDFS block moved, etc. the exception will be swallowed. In the case of a table not being calculated, no float will return in the output (line simply skipped); in the case of HDFS data not being found, the locality computation will assume that block to be non-local.<\/p>\n<h1>Post-Script<\/h1>\n<p>Some nice follow-on work to make this data into a useful metric, might be to augment with the size of the blocks (in records or bytes) and determine a locality percentage on size not only blocks. Further, for folks using <a href=\"http:\/\/issues.apache.org\/jira\/browse\/HBASE-10070\">stand-by regions<\/a> breaking out locality of replicated blocks may be important as well.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>HBase Region Locality HBase provides information on region locality via JMX per region server via the hbase.regionserver.percentFilesLocal. However, there is a challenge when running a multi-tenant environment or doing performance analysis. This percent of files local is for the entire region server but of course each region server can serve regions for multiple tables. And [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[19,25],"tags":[],"_links":{"self":[{"href":"https:\/\/clayb.net\/blog\/wp-json\/wp\/v2\/posts\/86"}],"collection":[{"href":"https:\/\/clayb.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/clayb.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/clayb.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/clayb.net\/blog\/wp-json\/wp\/v2\/comments?post=86"}],"version-history":[{"count":0,"href":"https:\/\/clayb.net\/blog\/wp-json\/wp\/v2\/posts\/86\/revisions"}],"wp:attachment":[{"href":"https:\/\/clayb.net\/blog\/wp-json\/wp\/v2\/media?parent=86"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/clayb.net\/blog\/wp-json\/wp\/v2\/categories?post=86"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/clayb.net\/blog\/wp-json\/wp\/v2\/tags?post=86"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}