HBase Region Locality
HBase provides information on region locality via JMX per region server via the hbase.regionserver.percentFilesLocal. However, there is a challenge when running a multi-tenant environment or doing performance analysis. This percent of files local is for the entire region server but of course each region server can serve regions for multiple tables. And further, each region can be made up of multiple store files each with their own location.
If one is doing a performance evaluation for a table, these metrics are not sufficient!
How to See Each Region
To see a more detailed breakdown, we can use HDFS to tell us where a file’s blocks live. Further, we can point HDFS to the files making up a table by looking under the HBase hbase.rootdir and build up a list of LocatedFileStatus objects for each file. Nicely, LocatedFileStatus provides getBlockLocations() which can provide the serving hosts for each HDFS block.
Lastly, all we need to do is correlate which region servers have local blocks for regions they are serving; now we can come up with a table locality percentage.
Implementation
One can do nifty things in the HBase shell as it is really a full JRuby shell. Particularly, one can enter arbitrary Java to run which works great debugging — or running performance tests. The following is the needed JRuby, which can be saved to a file and executed via hbase shell <file name> or simply copy and pasted into the shell.
require 'set'
include Java
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.HConstants
import org.apache.hadoop.hbase.HTableDescriptor
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.TableName
import org.apache.hadoop.io.Text
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import java.util.NoSuchElementException
import java.io.FileNotFoundException
# Return a Hash of region UUIDs to hostnames with column family stubs
#
# tableName - table to return regions for
#
# Example
# getRegionUUIDs "TestTable"
# # => {"3fe594363a2c13a3550f752db147194b"=>{"host" => "r1n1.example.com", "cfs" => {"f1" => {}, "f2" => {}},
# "da19a80cc403daa9a8f82ac9a1253e9d"=>{"host" => "r1n2.example.com", "cfs" => {"f1" => {}, "f2" => {}}}}
#
def getRegionUUIDs(tableName)
c = HBaseConfiguration.new()
tableNameObj = TableName.valueOf(tableName)
t = HTable.new(c, tableNameObj)
regions = t.getRegionsInRange(t.getStartKeys[0],
t.getEndKeys[t.getEndKeys.size-1])
# get all column families -- XXX do all regions have to host all CF's?
cfs = HTable.new(c, tableNameObj).getTableDescriptor.getFamilies().map{ |cf| cf.getNameAsString() }
r_to_host = regions.map{|r| [r.getRegionInfo().getEncodedName(), Hash["host" => r.getHostname(), "cfs" => Hash[cfs.map{|cf| [cf, Hash.new()] }]]] }
Hash[r_to_host]
end
def findHDFSBlocks(regions, tableName)
# augment regions with HDFS block locations
augmented = regions.clone
c = HBaseConfiguration.new()
fs = FileSystem.newInstance(c)
hbase_rootdir = c.select{|r| r.getKey() == "hbase.rootdir"}.first.getValue
tableNameObj = TableName.valueOf(tableName)
nameSpace = tableNameObj.getNamespaceAsString
baseTableName = tableNameObj.getQualifierAsString
# use the default namespace if nongiven
nameSpace = "default" if nameSpace == tableName
regions.each do |r, values|
values["cfs"].keys().each do |cf|
rPath = Path.new(Pathname.new(hbase_rootdir).join("data", nameSpace, baseTableName, r, cf).to_s)
begin
files = fs.listFiles(rPath, true)
rescue java.io.FileNotFoundException
next
end
begin
begin
fStatus = files.next()
hosts = fStatus.getBlockLocations().map { |block| Set.new(block.getHosts().to_a) }
augmented[r]["cfs"][cf][File.basename(fStatus.getPath().toString())] = hosts
rescue NativeException, java.util.NoSuchElementException
fStatus = false
end
end until fStatus == false
end
end
augmented
end
def computeLocalityByBlock(regions)
non_local_blocks = []
regions.each do |r, values|
values["cfs"].each do |cf, hFiles|
hFiles.each do |id, blocks|
blocks.each_index do |idx|
non_local_blocks.push(Pathname.new(r).join(cf, id, idx.to_s).to_s) unless blocks[idx].include?(values["host"])
end
end
end
end
non_local_blocks
end
def totalBlocks(regions)
regions.map do |r, values|
values["cfs"].map do |cf, hFiles|
hFiles.map do |id, blocks|
blocks.count
end
end
end.flatten().reduce(0, :+)
end
tables = list
tables.each do |tableName|
puts tableName
begin
regions = getRegionUUIDs(tableName)
hdfs_blocks_by_region = findHDFSBlocks(regions, tableName)
non_local_blocks = computeLocalityByBlock(hdfs_blocks_by_region)
total_blocks = totalBlocks(hdfs_blocks_by_region)
puts non_local_blocks.length().to_f/total_blocks if total_blocks > 0 # e.g. if table not empty or disabled
rescue org.apache.hadoop.hbase.TableNotFoundException
true
end
end
One will get output of the form table-name newline float of locality percentage (0.0-1.0). Should the table be offline, deleted (TableNotFoundException), an HDFS block moved, etc. the exception will be swallowed. In the case of a table not being calculated, no float will return in the output (line simply skipped); in the case of HDFS data not being found, the locality computation will assume that block to be non-local.
Post-Script
Some nice follow-on work to make this data into a useful metric, might be to augment with the size of the blocks (in records or bytes) and determine a locality percentage on size not only blocks. Further, for folks using stand-by regions breaking out locality of replicated blocks may be important as well.