Accessing Kerberized HDFS via Jython

Why Kerberos?

So, you want to do some testing of your shiny new Oozie workflow, or you want to write some simple data management task — nothing complex — but your cluster is Kerberized?

Certainly there are many reasons to use Kerberos on your cluster. A cluster with no permissions is dangerous in even the most relaxed development environment, while simple Unix authentication can suffice for some sharing of a Hadoop cluster — but to be reasonably sure people are not subverting your ACL’s or staring at data they should not be, Kerberos is currently the answer.

However, Kerberos and strong authentication are relatively sweeping and new additions1,2 to Hadoop. And as such, there are not a ton of examples on how to ensure your client code plays nicely.

The below is if you are writing code to access Hadoop services but not run under Map-Reduce. For example, my use case accessed HDFS data and manipulated Oozie jobs.

The ugly error

If your code is not Kerberos aware and you run it on a kerberized cluster, you will likely get back an error akin to the following — even if you have recently run kinit(1) and have all your tickets in a row:

 File "/tmp/workflow_test/utilities/data_catalog/hdfs.py", line 326, in _new_hdfs_instance fs = FileSystem.get(uri, conf, user) [...] org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.ipc.RemoteException: Authentication is required

No ride on the Hadoop without a ticket

This might lead one to check that you have your Kerberos tickets in order. To verify your tickets, one would usually run the command kinit(1) to provide a password and ensure one’s TGT (ticket granting ticket) is not expired. Further, to see all’s well, one can use the klist(1) command.

Where to start

Still, with all tickets in a row, this error is not particularly helpful. You know (and likely want) your cluster to require authentication but how to you provide your authentication? One might think org.apache.hadoop.security is likely a good start — and they would be right! However, it’s a big package with 21 classes in it; there are some likely starting points within:

The UserGroupInformation Class

So, with a crumb to go on, we know the UserGroupInformation class can get us going but there are some funky secrets and a few wrong paths too

First we need a Hadoop Configuration object

Many calls in the Hadoop API require an org.apache.hadoop.conf.Configuration object. However, one who does not have to program the Hadoop API, may take it for granted that all of the /etc/hadoop/conf/{core,hdfs,mapred}-site.xml files simply get automatically loaded into their Hadoop environment for them.

To store this information, Hadoop uses a Configuration object. One can make a blank configuration quite easily:

from org.apache.hadoop.conf import Configuration as HadoopConfiguration conf = HadoopConfiguration()

However, it would be nice if that Configuration object had any of our values set in it. (So far, it does not, except in one special case that I have found.) Unless one is running their code in an Oozie workflow, I have not found any environment where this Configuration object will be configured for you — it will simply have blank and default values — nothing from your *-site.xml files.

To get the Configuration object configured, one has to instantiate a blank Configuration object, load in the various desired XML files and add in an unusual property — unless you want to use Kerberos!

from org.apache.hadoop.conf import Configuration as HadoopConfiguration conf = HadoopConfiguration() # default to /etc/hadoop unless $HADOOP_HOME is set in the environment hadoop_conf = os.getenv("HADOOP_CONF_DIR","/etc/hadoop/conf") # add in desired XML files conf.addResource(hdfs.JURL('file://%s/core-site.xml' % hadoop_conf)) conf.addResource(hdfs.JURL('file://%s/hdfs-site.xml' % hadoop_conf)) conf.addResource(hdfs.JURL('file://%s/mapred-site.xml' % hadoop_conf)) # and add in a special directive to use Kerberos conf.set("hadoop.security.authentication", "kerberos")

The last directive, setting hadoop.secutity.authentication to kerberos, causes the UserGroupInformation class — when instantiated — to know that its org.apache.hadoop.security.UserGroupInformation.AuthenticationMethod should be set to the Kerberos constant; setting this knowledge, is something which does not seem changeable on-the-fly.

It makes sense that, when running in Oozie, one should have their configuration values and not need to parse /etc/hadoop/conf/{core,hdfs,mapred}-site.xml files. As when running under Oozie, one is running under Hadoop Map-Reduce and so should be configured already.

Now to finally talk to the Kerberos libraries

With the Configuration object setup, it is time to finally talk to the Kerberos libraries. Now is a good time to point out, if one wishes to see where the Kerberos exchange initially happens — and what actually happens — a useful debugging option can be passed in when starting jython(1) (or java(1)): -Dsun.security.krb5.debug=true.

To tell the system we want to use the Configuration object just setup, we have to use the getLoginUser() or getCurrentUser() methods will here work to get you a UserGroupInformation class properly authenticated using your system TGT from your last run of kinit(1).

Now the interesting part is, once this UserGroupInformation instance is created, whatever magic it provides seems to happen all behind the scenes. There is little need for this object further, unless you need your Kerberos principle string (you can use the instance’s toString() method for the Kerberos principle string).

Pitfalls

In grand Indiana Jones style, there are still some ways to have all the answers and still not get a working Kerberos authenticated session. One which highlights the behind the scenes magic of what the UserGroupInformation class does, is if you have already caused some API call to configure you a JAAS subject instance, then when trying to load your Configuration object with setConfiguration(), one gets the error:

] >>> org.apache.hadoop.security.UserGroupInformation.setConfiguration(conf) 12/05/04 03:31:23 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. 

This makes sense from the perspective that the JAAS subject can contain more than one identity. (For example, what if you have to authenticate to two clusters with differing security domains?) However, this leads me to believe I may have found a way to get Kerberos authentication working but may not be doing it correctly, as unless I am only doing Kerberos authentication, I do not know how I could load another identity into that JAAS subject. (I have read through much of the Hadoop-6299 code patch and it seems UserGroupInformation does all the JAAS work for us.)

It seems the classes providing authentication have matured quite a bit from Hadoop 0.20.2 (pre-Kerberos), CDH3u0 (where my code runs), to Hadoop 1.0.2. However, still, it is still not always explicit when one has made the choice to use Kerberos, exactly how things throughout the system might change.

I would like to credit Cloudera with their great documentation on setting up a Kerberized cluster!

2 responses to “Accessing Kerberized HDFS via Jython”

  1. Wade

    First your article give me some hope of use java client connect to a secure hadoop cluster.But i’ve read through the whole article carefully work out the problem.
    Could you post a sample code?
    Thanks!

  2. Raja Thiruvathuru

    Comprehensive understanding of UserGroupInformation and Kerberos. You are right, there are many ways of getting authenticated to kerberos using UserGroupInformation

Leave a Reply