Finding HBase Region Locations

HBase Region Locality

HBase provides information on region locality via JMX per region server via the hbase.regionserver.percentFilesLocal. However, there is a challenge when running a multi-tenant environment or doing performance analysis. This percent of files local is for the entire region server but of course each region server can serve regions for multiple tables. And further, each region can be made up of multiple store files each with their own location.

If one is doing a performance evaluation for a table, these metrics are not sufficient!

How to See Each Region

To see a more detailed breakdown, we can use HDFS to tell us where a file’s blocks live. Further, we can point HDFS to the files making up a table by looking under the HBase hbase.rootdir and build up a list of LocatedFileStatus objects for each file. Nicely, LocatedFileStatus provides getBlockLocations() which can provide the serving hosts for each HDFS block.

Lastly, all we need to do is correlate which region servers have local blocks for regions they are serving; now we can come up with a table locality percentage.

Implementation

One can do nifty things in the HBase shell as it is really a full JRuby shell. Particularly, one can enter arbitrary Java to run which works great debugging — or running performance tests. The following is the needed JRuby, which can be saved to a file and executed via hbase shell <file name> or simply copy and pasted into the shell.

require 'set'
include Java
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.HConstants
import org.apache.hadoop.hbase.HTableDescriptor
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.TableName
import org.apache.hadoop.io.Text
 
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import java.util.NoSuchElementException
import java.io.FileNotFoundException
 
# Return a Hash of region UUIDs to hostnames with column family stubs
#
# tableName - table to return regions for
#
# Example
# getRegionUUIDs "TestTable"
# # => {"3fe594363a2c13a3550f752db147194b"=>{"host" => "r1n1.example.com", "cfs" => {"f1" => {}, "f2" => {}},
#       "da19a80cc403daa9a8f82ac9a1253e9d"=>{"host" => "r1n2.example.com", "cfs" => {"f1" => {}, "f2" => {}}}}
#
def getRegionUUIDs(tableName)
  c = HBaseConfiguration.new()
  tableNameObj = TableName.valueOf(tableName)
  t = HTable.new(c, tableNameObj)
  regions = t.getRegionsInRange(t.getStartKeys[0],
                                t.getEndKeys[t.getEndKeys.size-1])
  # get all column families -- XXX do all regions have to host all CF's?
  cfs = HTable.new(c, tableNameObj).getTableDescriptor.getFamilies().map{ |cf| cf.getNameAsString() }
 
  r_to_host = regions.map{|r| [r.getRegionInfo().getEncodedName(), Hash["host" => r.getHostname(), "cfs" => Hash[cfs.map{|cf| [cf, Hash.new()] }]]] }
 
  Hash[r_to_host]
end
 
def findHDFSBlocks(regions, tableName)
  # augment regions with HDFS block locations
  augmented = regions.clone
  c = HBaseConfiguration.new()
  fs = FileSystem.newInstance(c)
  hbase_rootdir = c.select{|r| r.getKey() == "hbase.rootdir"}.first.getValue
  tableNameObj = TableName.valueOf(tableName)
  nameSpace = tableNameObj.getNamespaceAsString
  baseTableName = tableNameObj.getQualifierAsString
  # use the default namespace if nongiven
  nameSpace = "default" if nameSpace == tableName
 
  regions.each do |r, values|
    values["cfs"].keys().each do |cf|
      rPath = Path.new(Pathname.new(hbase_rootdir).join("data", nameSpace, baseTableName, r, cf).to_s)
      begin
        files = fs.listFiles(rPath, true)
      rescue java.io.FileNotFoundException
        next
      end
 
      begin
        begin
          fStatus = files.next()
          hosts = fStatus.getBlockLocations().map { |block| Set.new(block.getHosts().to_a) }
          augmented[r]["cfs"][cf][File.basename(fStatus.getPath().toString())] = hosts
        rescue NativeException, java.util.NoSuchElementException
          fStatus = false
        end
      end until fStatus == false
    end
  end
  augmented
end
 
def computeLocalityByBlock(regions)
  non_local_blocks = []
  regions.each do |r, values|
    values["cfs"].each do |cf, hFiles|
      hFiles.each do |id, blocks|
        blocks.each_index do |idx|
          non_local_blocks.push(Pathname.new(r).join(cf, id, idx.to_s).to_s) unless blocks[idx].include?(values["host"])
        end
      end
    end
  end
  non_local_blocks
end
 
def totalBlocks(regions)
  regions.map do |r, values|
    values["cfs"].map do |cf, hFiles|
      hFiles.map do |id, blocks|
        blocks.count
      end
    end
  end.flatten().reduce(0, :+)
end
 
tables = list
tables.each do |tableName|
  puts tableName
  begin
    regions = getRegionUUIDs(tableName)
    hdfs_blocks_by_region = findHDFSBlocks(regions, tableName)
    non_local_blocks = computeLocalityByBlock(hdfs_blocks_by_region)
    total_blocks = totalBlocks(hdfs_blocks_by_region)
    puts non_local_blocks.length().to_f/total_blocks if total_blocks > 0 # e.g. if table not empty or disabled
  rescue org.apache.hadoop.hbase.TableNotFoundException
    true
  end
end

 

One will get output of the form table-name newline float of locality percentage (0.0-1.0). Should the table be offline, deleted (TableNotFoundException), an HDFS block moved, etc. the exception will be swallowed. In the case of a table not being calculated, no float will return in the output (line simply skipped); in the case of HDFS data not being found, the locality computation will assume that block to be non-local.

Post-Script

Some nice follow-on work to make this data into a useful metric, might be to augment with the size of the blocks (in records or bytes) and determine a locality percentage on size not only blocks. Further, for folks using stand-by regions breaking out locality of replicated blocks may be important as well.

Leave a Reply