Hadoop

Posts about Apache Hadoop and the related ecosystem

Debugging HBase Unit Tests

By Clay on November 13, 2016

This is likely an obvious process for those who use IDE’s and develop in Maven daily but for those who do operations or otherwise need to work on the JUnit tests in HBase only infrequently, here’s how I worked when submitting a patch for HBASE-16700. First, create your code: Here I was adding a MasterObserver […]

Posted in HBase, Java | Leave a response

Finding HBase Region Locations

By Clay on June 8, 2015

HBase Region Locality HBase provides information on region locality via JMX per region server via the hbase.regionserver.percentFilesLocal. However, there is a challenge when running a multi-tenant environment or doing performance analysis. This percent of files local is for the entire region server but of course each region server can serve regions for multiple tables. And […]

Posted in HBase, JRuby | Leave a response

Map/Reduce diff(1)

By Clay on June 8, 2015

This has sadly been a draft for years, so time to release it… diff(1) For those who use Unix, you have likely come across two files and wanted to see what was different between the two. Certainly, one can compare size (highly inaccurate), use a hash function (if a strong cryptographic hash, it will be […]

Posted in Hadoop, Pig | Leave a response

Configuration Files

By Clay on October 21, 2012

Many systems have requirements to store configuration parameters. In these systems, a number of choices can be made for how to store that data; sometimes this diversity is painful, however. Choices for storing configuration data are often: Firefox uses and Apple often chooses to use sqlite3 databases1 Python programs often use ConfigParser to processÂ initialization (.ini) […]

Posted in Hadoop, Python | Leave a response

Accessing Kerberized HDFS via Jython

By Clay on May 4, 2012

How to access HDFS on a Kerberos secured Hadoop cluster — code and background!

Posted in Hadoop, Jython | 2 Responses