{"id":61,"date":"2012-05-04T04:01:01","date_gmt":"2012-05-04T04:01:01","guid":{"rendered":"http:\/\/clayb.net\/blog\/?p=61"},"modified":"2012-05-04T19:32:41","modified_gmt":"2012-05-04T19:32:41","slug":"accessing-kerberized-hdfs-via-jython","status":"publish","type":"post","link":"https:\/\/clayb.net\/blog\/accessing-kerberized-hdfs-via-jython\/","title":{"rendered":"Accessing Kerberized HDFS via Jython"},"content":{"rendered":"<h1>Why Kerberos?<\/h1>\n<p>So, you want to do some testing of your shiny new Oozie workflow, or you want to write some simple data management task &#8212; nothing complex &#8212; but your cluster is Kerberized?<\/p>\n<p>Certainly there are many reasons to use Kerberos on your cluster. A cluster with no permissions is dangerous in even the most relaxed development environment, while simple Unix authentication can suffice for some sharing of a Hadoop cluster &#8212; but to be reasonably sure people are not subverting your ACL&#8217;s or staring at data they should not be, Kerberos is currently the answer.<br \/>\n<!--more--><br \/>\nHowever, Kerberos and strong authentication are relatively sweeping and new additions<sup><a href=\"https:\/\/issues.apache.org\/jira\/browse\/HADOOP-4343\">1<\/a>,<a href=\"https:\/\/issues.apache.org\/jira\/browse\/HADOOP-1701\">2<\/a><\/sup> to Hadoop. And as such, there are not a ton of examples on how to ensure your client code plays nicely.<\/p>\n<p>The below is if you are writing code to access Hadoop services but not run under Map-Reduce. For example, my use case accessed HDFS data and manipulated Oozie jobs.<\/p>\n<h2>The ugly error<\/h2>\n<p>If your code is not Kerberos aware and you run it on a kerberized cluster, you will likely get back an error akin to the following &#8212; even if you have recently run <tt>kinit(1)<\/tt> and have all your tickets in a row:<\/p>\n<pre><code> File \"\/tmp\/workflow_test\/utilities\/data_catalog\/hdfs.py\", line 326, in _new_hdfs_instance fs = FileSystem.get(uri, conf, user) [...] org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.ipc.RemoteException: Authentication is required<\/code><\/pre>\n<h2>No ride on the Hadoop without a ticket<\/h2>\n<p>This might lead one to check that you have your Kerberos tickets in order. To verify your tickets, one would usually run the command <tt><a href=\"http:\/\/web.mit.edu\/kerberos\/krb5-1.5\/krb5-1.5.4\/doc\/krb5-user\/Obtaining-Tickets-with-kinit.html\">kinit(1)<\/a><\/tt> to provide a password and ensure one&#8217;s TGT (<a href=\"http:\/\/web.mit.edu\/kerberos\/www\/krb5-1.2\/krb5-1.2.6\/doc\/user-guide.html#SEC2\">ticket granting ticket<\/a>) is not expired. Further, to see all&#8217;s well, one can use the <tt><a href=\"http:\/\/web.mit.edu\/kerberos\/krb5-1.5\/krb5-1.5.4\/doc\/krb5-user\/Viewing-Your-Tickets-with-klist.html\">klist(1)<\/a><\/tt> command.<\/p>\n<h1>Where to start<\/h1>\n<p>Still, with all tickets in a row, this error is not particularly helpful. You know (and likely want) your cluster to require authentication but how to you provide your authentication? One might think <a href=\"hadoop.apache.org\/common\/docs\/r1.0.2\/api\/org\/apache\/hadoop\/security\/package-summary.html\">org.apache.hadoop.security<\/a> is likely a good start &#8212; and they would be right! However, it&#8217;s a big package with 21 classes in it; there are some likely starting points within:<\/p>\n<ul>\n<li><a href=\"http:\/\/hadoop.apache.org\/common\/docs\/r1.0.2\/api\/org\/apache\/hadoop\/security\/SecurityUtil.html\">org.apache.hadoop.security.SecurityUtil<\/a> &#8212; this one had me for a bit, but as a user, I will not be using a <a href=\"http:\/\/web.mit.edu\/kerberos\/krb5-1.5\/krb5-1.5.4\/doc\/krb5-install\/The-Keytab-File.html\">keytab<\/a>, as I am simply granted a TGT upon login; were I an automated service, I would be using a keytab.<\/li>\n<li><a href=\"http:\/\/hadoop.apache.org\/common\/docs\/r1.0.2\/api\/org\/apache\/hadoop\/security\/UserGroupInformation.HadoopLoginModule.html\">org.apache.hadoop.security.UserGroupInformation.HadoopLoginModule<\/a> &#8212; this had me next; I tried to understand the JAAS (Java Authentication and Authorization Service) <a href=\"http:\/\/docs.oracle.com\/javase\/6\/docs\/api\/javax\/security\/auth\/spi\/LoginModule.html\">loginModule<\/a> notes and looked up a <a href=\"http:\/\/docs.oracle.com\/javase\/1.4.2\/docs\/guide\/security\/jaas\/tutorials\/GeneralAcnAndAzn.html\">tutorial<\/a>, <a href=\"http:\/\/docs.oracle.com\/javase\/1.4.2\/docs\/guide\/security\/jaas\/JAASLMDevGuide.html\">developer&#8217;s guide<\/a>; there is even a reasonable <a href=\"http:\/\/www.jaasbook.com\/\">book<\/a> on the subject (chapters 2 and 3 are applicable here). However, for getting Kerberos and JAAS to work in the way meant by JAAS, I am not there yet.<\/li>\n<li><a href=\"http:\/\/hadoop.apache.org\/common\/docs\/r1.0.2\/api\/org\/apache\/hadoop\/security\/UserGroupInformation.html\">org.apache.hadoop.security.UserGroupInformation<\/a> &#8212; then, I found a <em>hadoop-common-user<\/em> mail thread <a href=\"http:\/\/mail-archives.apache.org\/mod_mbox\/hadoop-common-user\/201101.mbox\/%3CAANLkTimYv3UQkAJHzC9GM2eyx9Ztbn-JTDwPOtSfmoDn@mail.gmail.com%3E\">&#8220;Accessing Hadoop using Kerberos&#8221;<\/a> from someone trying to do the same thing early last year. But alas, it only referenced the <tt>UserGroupInformation<\/tt> class; there was hope!<\/li>\n<\/ul>\n<h1>The <tt>UserGroupInformation<\/tt> Class<\/h1>\n<p>So, with a crumb to go on, we know the UserGroupInformation class can get us going but there are some funky secrets and a few wrong paths too<\/p>\n<h2>First we need a Hadoop <tt>Configuration<\/tt> object<\/h2>\n<p>Many calls in the Hadoop API require an <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/r1.0.2\/api\/org\/apache\/hadoop\/conf\/Configuration.html\">org.apache.hadoop.conf.Configuration<\/a> object. However, one who does not have to program the Hadoop API, may take it for granted that all of the <tt>\/etc\/hadoop\/conf\/{core,hdfs,mapred}-site.xml<\/tt> files simply get automatically loaded into their Hadoop environment for them.<\/p>\n<p>To store this information, Hadoop uses a <tt>Configuration<\/tt> object. One can make a blank configuration quite easily:<\/p>\n<pre><code>from org.apache.hadoop.conf import Configuration as HadoopConfiguration conf = HadoopConfiguration()<\/code><\/pre>\n<p>However, it would be nice if that <tt>Configuration<\/tt> object had any of our values set in it. (So far, it does not, except in one special case that I have found.) Unless one is running their code in an Oozie workflow, I have not found any environment where this <tt>Configuration<\/tt> object will be configured for you &#8212; it will simply have blank and default values &#8212; nothing from your <tt>*-site.xml<\/tt> files.<\/p>\n<p>To get the <tt>Configuration<\/tt> object configured, one has to instantiate a blank <tt>Configuration<\/tt> object, load in the various desired XML files and add in an unusual property &#8212; unless you want to use Kerberos!<\/p>\n<pre><code>from org.apache.hadoop.conf import Configuration as HadoopConfiguration conf = HadoopConfiguration() # default to \/etc\/hadoop unless $HADOOP_HOME is set in the environment hadoop_conf = os.getenv(\"HADOOP_CONF_DIR\",\"\/etc\/hadoop\/conf\") # add in desired XML files conf.addResource(hdfs.JURL('file:\/\/%s\/core-site.xml' % hadoop_conf)) conf.addResource(hdfs.JURL('file:\/\/%s\/hdfs-site.xml' % hadoop_conf)) conf.addResource(hdfs.JURL('file:\/\/%s\/mapred-site.xml' % hadoop_conf)) # and add in a special directive to use Kerberos conf.set(\"hadoop.security.authentication\", \"kerberos\")<\/code><\/pre>\n<p>The last directive, setting <tt>hadoop.secutity.authentication<\/tt> to <tt>kerberos<\/tt>, causes the <tt>UserGroupInformation<\/tt> class &#8212; when instantiated &#8212; to know that its <tt><a href=\"http:\/\/hadoop.apache.org\/common\/docs\/r1.0.2\/api\/org\/apache\/hadoop\/security\/UserGroupInformation.AuthenticationMethod.html\">org.apache.hadoop.security.UserGroupInformation.AuthenticationMethod<\/a><\/tt> should be set to the <tt>Kerberos<\/tt> constant; setting this knowledge, is something which does not seem changeable on-the-fly.<\/p>\n<p><small>It makes sense that, when running in Oozie, one should have their configuration values and not need to parse <tt>\/etc\/hadoop\/conf\/{core,hdfs,mapred}-site.xml<\/tt> files. As when running under Oozie, one is running under Hadoop Map-Reduce and so should be configured already.<\/small><\/p>\n<h2>Now to finally talk to the Kerberos libraries<\/h2>\n<p>With the <tt>Configuration<\/tt> object setup, it is time to finally talk to the Kerberos libraries. Now is a good time to point out, if one wishes to see where the Kerberos exchange initially happens &#8212; and what actually happens &#8212; a useful debugging option can be passed in when starting <tt>jython(1)<\/tt> (or <tt>java(1)<\/tt>): <tt>-Dsun.security.krb5.debug=true<\/tt>.<\/p>\n<p>To tell the system we want to use the <tt>Configuration<\/tt> object just setup, we have to use the <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/r1.0.2\/api\/org\/apache\/hadoop\/security\/UserGroupInformation.html#setConfiguration%28org.apache.hadoop.conf.Configuration%29&gt;setConfiguration&lt;\/a&gt; static method on the &lt;tt&gt;UserGroupInformation&lt;\/tt&gt; class. If you are running with the Kerberos debug flag, this will result in a line about &lt;tt&gt;\/etc\/krb5.conf&lt;\/tt&gt;, yet still we have not gotten our Kerberos credentials. Getting our Kerberos credentials does not happen, until instantiating a &lt;tt&gt;UserGroupInformation&lt;\/tt&gt; object.&lt;\/p&gt;\n&lt;p&gt;We will not see any grand firework display to know that we are properly authenticated (though the debug output is pretty voluminous). Still, the authentication routines are kept quiet -- to the point there is no constructor for the &lt;tt&gt;UserGroupInformation&lt;\/tt&gt; class -- one has to call a static method to get a class instance. Either the &lt;tt&gt;&lt;a href=\">getLoginUser()<\/a> or <tt><a href=\"http:\/\/hadoop.apache.org\/common\/docs\/r1.0.2\/api\/org\/apache\/hadoop\/security\/UserGroupInformation.html#getCurrentUser%28%29\">getCurrentUser()<\/a><\/tt> methods will here work to get you a <tt>UserGroupInformation<\/tt> class properly authenticated using your system TGT from your last run of <tt>kinit(1)<\/tt>.<\/p>\n<p>Now the interesting part is, once this <tt>UserGroupInformation<\/tt> instance is created, whatever magic it provides seems to happen all behind the scenes. There is little need for this object further, unless you need your <a href=\"http:\/\/web.mit.edu\/kerberos\/krb5-1.5\/krb5-1.5.4\/doc\/krb5-user\/What-is-a-Kerberos-Principal_003f.html\">Kerberos principle<\/a> string (you can use the instance&#8217;s <tt><a href=\"http:\/\/hadoop.apache.org\/common\/docs\/r1.0.2\/api\/org\/apache\/hadoop\/security\/UserGroupInformation.html#toString%28%29\">toString()<\/a><\/tt> method for the Kerberos principle string).<\/p>\n<h1>Pitfalls<\/h1>\n<p>In grand <a href=\"http:\/\/www.imdb.com\/media\/rm2224131328\/tt0082971\">Indiana Jones<\/a> style, there are still some ways to have all the answers and still not get a working Kerberos authenticated session. One which highlights the behind the scenes magic of what the <tt>UserGroupInformation<\/tt> class does, is if you have already caused some API call to configure you a <a href=\"http:\/\/docs.oracle.com\/javase\/1.4.2\/docs\/api\/javax\/security\/auth\/Subject.html\">JAAS <tt>subject<\/tt><\/a> instance, then when trying to load your <tt>Configuration<\/tt> object with <tt>setConfiguration()<\/tt>, one gets the error:<\/p>\n<pre><code>] &gt;&gt;&gt; org.apache.hadoop.security.UserGroupInformation.setConfiguration(conf) 12\/05\/04 03:31:23 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. <\/code><\/pre>\n<p>This makes sense from the perspective that the JAAS <tt>subject<\/tt> can contain more than one identity. (For example, what if you have to authenticate to two clusters with differing security domains?) However, this leads me to believe I may have found a way to get Kerberos authentication working but may not be doing it correctly, as unless I am only doing Kerberos authentication, I do not know how I could load another identity into that JAAS <tt>subject<\/tt>. (I have read through much of the <a href=\"https:\/\/issues.apache.org\/jira\/browse\/HADOOP-6299\">Hadoop-6299<\/a> code <a href=\"https:\/\/issues.apache.org\/jira\/secure\/attachment\/12434362\/HADOOP-6299-Y20.patch\">patch<\/a> and it seems <tt>UserGroupInformation<\/tt> does all the JAAS work for us.)<\/p>\n<p>It seems the classes providing authentication have matured quite a bit from <a href=\"hadoop.apache.org\/common\/docs\/r0.20.2\/api\/org\/apache\/hadoop\/security\/UserGroupInformation.html\">Hadoop 0.20.2<\/a> (pre-Kerberos), <a href=\"http:\/\/archive.cloudera.com\/cdh\/3\/hadoop-0.20.2-cdh3u0\/api\/org\/apache\/hadoop\/security\/UserGroupInformation.html\">CDH3u0<\/a> (<strong>where my code runs<\/strong>), to <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/r1.0.2\/api\/org\/apache\/hadoop\/security\/UserGroupInformation.html\">Hadoop 1.0.2<\/a>. However, still, it is still not always explicit when one has made the choice to use Kerberos, exactly how things throughout the system might change.<\/p>\n<p>I would like to credit Cloudera with their great <a href=\"https:\/\/ccp.cloudera.com\/display\/CDHDOC\/CDH3+Security+Guide\">documentation<\/a> on setting up a Kerberized cluster!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>How to access HDFS on a Kerberos secured Hadoop cluster &#8212; code and background!<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[14,15],"tags":[],"_links":{"self":[{"href":"https:\/\/clayb.net\/blog\/wp-json\/wp\/v2\/posts\/61"}],"collection":[{"href":"https:\/\/clayb.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/clayb.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/clayb.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/clayb.net\/blog\/wp-json\/wp\/v2\/comments?post=61"}],"version-history":[{"count":0,"href":"https:\/\/clayb.net\/blog\/wp-json\/wp\/v2\/posts\/61\/revisions"}],"wp:attachment":[{"href":"https:\/\/clayb.net\/blog\/wp-json\/wp\/v2\/media?parent=61"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/clayb.net\/blog\/wp-json\/wp\/v2\/categories?post=61"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/clayb.net\/blog\/wp-json\/wp\/v2\/tags?post=61"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}