HBase Region Locality
HBase provides information on region locality via JMX per region server via the hbase.regionserver.percentFilesLocal. However, there is a challenge when running a multi-tenant environment or doing performance analysis. This percent of files local is for the entire region server but of course each region server can serve regions for multiple tables. And further, each region can be made up of multiple store files each with their own location.
If one is doing a performance evaluation for a table, these metrics are not sufficient!
How to See Each Region
To see a more detailed breakdown, we can use HDFS to tell us where a file’s blocks live. Further, we can point HDFS to the files making up a table by looking under the HBase hbase.rootdir and build up a list of LocatedFileStatus objects for each file. Nicely, LocatedFileStatus provides getBlockLocations() which can provide the serving hosts for each HDFS block.
Lastly, all we need to do is correlate which region servers have local blocks for regions they are serving; now we can come up with a table locality percentage.
Implementation
One can do nifty things in the HBase shell as it is really a full JRuby shell. Particularly, one can enter arbitrary Java to run which works great debugging — or running performance tests. The following is the needed JRuby, which can be saved to a file and executed via hbase shell <file name> or simply copy and pasted into the shell.
require 'set' include Java import org.apache.hadoop.hbase.HBaseConfiguration import org.apache.hadoop.hbase.HColumnDescriptor import org.apache.hadoop.hbase.HConstants import org.apache.hadoop.hbase.HTableDescriptor import org.apache.hadoop.hbase.client.HBaseAdmin import org.apache.hadoop.hbase.client.HTable import org.apache.hadoop.hbase.TableName import org.apache.hadoop.io.Text import org.apache.hadoop.fs.FileSystem import org.apache.hadoop.fs.Path import java.util.NoSuchElementException import java.io.FileNotFoundException # Return a Hash of region UUIDs to hostnames with column family stubs # # tableName - table to return regions for # # Example # getRegionUUIDs "TestTable" # # => {"3fe594363a2c13a3550f752db147194b"=>{"host" => "r1n1.example.com", "cfs" => {"f1" => {}, "f2" => {}}, # "da19a80cc403daa9a8f82ac9a1253e9d"=>{"host" => "r1n2.example.com", "cfs" => {"f1" => {}, "f2" => {}}}} # def getRegionUUIDs(tableName) c = HBaseConfiguration.new() tableNameObj = TableName.valueOf(tableName) t = HTable.new(c, tableNameObj) regions = t.getRegionsInRange(t.getStartKeys[0], t.getEndKeys[t.getEndKeys.size-1]) # get all column families -- XXX do all regions have to host all CF's? cfs = HTable.new(c, tableNameObj).getTableDescriptor.getFamilies().map{ |cf| cf.getNameAsString() } r_to_host = regions.map{|r| [r.getRegionInfo().getEncodedName(), Hash["host" => r.getHostname(), "cfs" => Hash[cfs.map{|cf| [cf, Hash.new()] }]]] } Hash[r_to_host] end def findHDFSBlocks(regions, tableName) # augment regions with HDFS block locations augmented = regions.clone c = HBaseConfiguration.new() fs = FileSystem.newInstance(c) hbase_rootdir = c.select{|r| r.getKey() == "hbase.rootdir"}.first.getValue tableNameObj = TableName.valueOf(tableName) nameSpace = tableNameObj.getNamespaceAsString baseTableName = tableNameObj.getQualifierAsString # use the default namespace if nongiven nameSpace = "default" if nameSpace == tableName regions.each do |r, values| values["cfs"].keys().each do |cf| rPath = Path.new(Pathname.new(hbase_rootdir).join("data", nameSpace, baseTableName, r, cf).to_s) begin files = fs.listFiles(rPath, true) rescue java.io.FileNotFoundException next end begin begin fStatus = files.next() hosts = fStatus.getBlockLocations().map { |block| Set.new(block.getHosts().to_a) } augmented[r]["cfs"][cf][File.basename(fStatus.getPath().toString())] = hosts rescue NativeException, java.util.NoSuchElementException fStatus = false end end until fStatus == false end end augmented end def computeLocalityByBlock(regions) non_local_blocks = [] regions.each do |r, values| values["cfs"].each do |cf, hFiles| hFiles.each do |id, blocks| blocks.each_index do |idx| non_local_blocks.push(Pathname.new(r).join(cf, id, idx.to_s).to_s) unless blocks[idx].include?(values["host"]) end end end end non_local_blocks end def totalBlocks(regions) regions.map do |r, values| values["cfs"].map do |cf, hFiles| hFiles.map do |id, blocks| blocks.count end end end.flatten().reduce(0, :+) end tables = list tables.each do |tableName| puts tableName begin regions = getRegionUUIDs(tableName) hdfs_blocks_by_region = findHDFSBlocks(regions, tableName) non_local_blocks = computeLocalityByBlock(hdfs_blocks_by_region) total_blocks = totalBlocks(hdfs_blocks_by_region) puts non_local_blocks.length().to_f/total_blocks if total_blocks > 0 # e.g. if table not empty or disabled rescue org.apache.hadoop.hbase.TableNotFoundException true end end
One will get output of the form table-name newline float of locality percentage (0.0-1.0). Should the table be offline, deleted (TableNotFoundException), an HDFS block moved, etc. the exception will be swallowed. In the case of a table not being calculated, no float will return in the output (line simply skipped); in the case of HDFS data not being found, the locality computation will assume that block to be non-local.
Post-Script
Some nice follow-on work to make this data into a useful metric, might be to augment with the size of the blocks (in records or bytes) and determine a locality percentage on size not only blocks. Further, for folks using stand-by regions breaking out locality of replicated blocks may be important as well.