Hadoop

Posts about Apache Hadoop and the related ecosystem

Debugging HBase Unit Tests

This is likely an obvious process for those who use IDE’s and develop in Maven daily but for those who do operations or otherwise need to work on the JUnit tests in HBase only infrequently, here’s how I worked when submitting a patch for HBASE-16700. First, create your code: Here I was adding a MasterObserver […]

Finding HBase Region Locations

HBase Region Locality HBase provides information on region locality via JMX per region server via the hbase.regionserver.percentFilesLocal. However, there is a challenge when running a multi-tenant environment or doing performance analysis. This percent of files local is for the entire region server but of course each region server can serve regions for multiple tables. And […]

Map/Reduce diff(1)

This has sadly been a draft for years, so time to release it… diff(1) For those who use Unix, you have likely come across two files and wanted to see what was different between the two. Certainly, one can compare size (highly inaccurate), use a hash function (if a strong cryptographic hash, it will be […]

Configuration Files

Many systems have requirements to store configuration parameters. In these systems, a number of choices can be made for how to store that data; sometimes this diversity is painful, however. Choices for storing configuration data are often: Firefox uses and Apple often chooses to use sqlite3 databases1 Python programs often use ConfigParser to process initialization (.ini) […]

Accessing Kerberized HDFS via Jython

How to access HDFS on a Kerberos secured Hadoop cluster — code and background!