Generating a Topographic Map

By Clay on February 16, 2021

My life has evolved the last decade with home ownership and land ownership. And in the last year, I now have my first construction project coming online. I am looking to build my workshop and first permanent building on my land.

To start the project, my partner and I have cleared a quarter acre of timber to reuse in the project. We now have a lovely stack of 50 logs and have sold quite a firewood quantity. This left us with a site with about 100 stumps sticking out of the ground on a site “reasonably” flat for the land. The stumps were dug out with our backhoe and will be reused, hopefully to nourish one of our fields.

Now we have a site with little vegetation above or in the ground; we have about one to two feet broken sandstone and then bedrock. Only problem is that it is “reasonably” — or more specifically not at all — level.

One issue at this stage of construction is to make sure the view and the drainage of our site is correct. To achieve that I am building the site up rather than digging effectively a hole to level the site. However, a key question for us and our 13 yard dump truck is, how much material will bring us to level and at what height?

For that, I have determined it is time to measure! And from those measurements, we can create a topographic map to send to a geotechnical engineer and ask the computer to tell us how much fill material we need. A builder’s level, matplotlib and numpy to the rescue!

Builder’s Level

For those of you who are unfamiliar with a builder’s level it is a precision distance and height measuring instrument which uses the fact that light travels in straight lines. One levels a builders level on a tripod with a bubble level and then sights on a grade rod. A grade rod is a flat bar with demarcations; in my case the rod has a mark every tenth of an inch and can extend up to 20′ tall. One also uses a bubble level when holding the grade rod to ensure the grade stick is perfectly level. Using these three devices, one can measure the height of the ground quite precisely and the distance from the builder’s level very well as well. A nice example of the usage can be seen demonstrated by fellow Coloradoan Jim Anderson on Fine Homebuilding.

My partner and I spent an afternoon one day and came up with a list of measurements surveying four lines of the site. This resulted in a file of x, y coordinates and height at the view of the builder’s level. These 42 measurements give us a reasonable idea but I wish I had more.

To read in the data we recorded, I chose to make a dictionary with its keys being the x-axes values; the x-axis being our North to South line of measurement. Each dictionary value was then a dictionary with its keys being the y-axis (East-West) distance from the level determined via the builder’s level’s stadia lines (lines above and below the measurement line); and the values are the depth (how much fill is needed at that point to level) to the eye-level height at our datum. This was all read in via CSV file:

# read in data
heights = defaultdict(dict)
with open("Workshop/Grading/from_datum") as f:
    h = csv.DictReader(f, fieldnames=["x", "y", "z"])
    for l in h:
        heights[int(l['x'])].update({int(l['y']):float(l['z'])})

Still how now to visualize the points?

Verify the Data

First things first, is to verify our measurement data. This was at first simply looking how to do a crude three-dimensional scatter plot to ensure that I did not put down bad data in my notes. I looked to use Python and the matplotlib library for this. One thing I did not anticipate was the variety of data-layouts that matplotlib might use!

Scatter and Polygon Plots

One way to create a three-dimensional scatter plot in matplotlib is to layout your data in three arrays. All arrays are indexed the same for each data point; one array for the x-dimension, one for the y-dimension and one for the z-dimension. This is a quite easy and straightforward!

# create data arrays
x_array = list()
y_array = list()
z_array = list()
x_keys = list(heights.keys())
x_keys.sort()
for (x, d) in ((x, heights[x]) for x in x_keys):
     for (y, z) in d.items():
         x_array.append(x)
         y_array.append(y)
         z_array.append(z)

# build a scatter plot of observations
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.set_title("Builders-Level Observations of "
             "Terrain Surface", fontsize=16)
ax.set_xlabel('North-South distance (ft)')
ax.set_ylabel('East-West distance (ft)')
ax.set_zlabel('Depth from Datum') 
ax.scatter([-1*v for v in x_array], y_array, z_array)
fig.show()

Next comes a three-dimensional polygon plot to show the profile of our data. This allows me to see if the data represents the curves, divots and bumps in the land as I know it. However, opposed to the scatter plot, here we need a two-dimensional vertex array. This was a bit confusing how to layout the data from the add_collection3d API docs.

In the end, the vertices are specified as an array of (in my case) y-dimension and z-dimension tuples. From there one then provides an array of x-dimensions along which those vertices are rendered. An fiddly addition needed as well is a termination point for the polygon so one doesn’t have a funky slope from the first to last point which would draw the eye away from the important data of what we recorded; similarly, since we have no other points to visualize the singular y-axis recordings with we have to skip all of those.

Filled Polygons Showing X-Axes Observations

Note, this code only shows how the x-axis polygons were created, the full code download shows the process for creating the polygon which goes down y=0 as well.

# polygon vertices used for filled plot graph
verts=[]
x_keys = list(heights.keys())
x_keys.sort()
for (x, d) in ((x, heights[x]) for x in x_keys): 
  line = []

  for (y, z) in d.items():
    # need y, z lines
    line.append((y,z))

  # only plot filled lines if we have
  # two or more observations
  if len(line) > 1:
    # set the last value of every filled plot line to 0
    line.append((y+1,0))
    verts.append(line)

colors = [mcolors.to_rgba('r', alpha=0.6),
          mcolors.to_rgba('g', alpha=0.6),
          mcolors.to_rgba('b', alpha=0.6)]

# only plot x-lines which have more than one observation
# so north is "farther away" than south flip the x-values
zs = [-1*a for a in x_keys if len(heights[a])>1]
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.set_title("Profile of Terrain Surface along"
             "Observation Lines", fontsize=16)
ax.set_xlabel('North-South distance (ft)')
ax.set_ylabel('East-West distance (ft)')
ax.set_zlabel('Depth from Datum')
poly = PolyCollection(verts, facecolors=colors)
ax.add_collection3d(poly, zs=zs, zdir='x')
# so north is "farther away" than south flip the x-values
ax.set_xlim3d(-1*max(x_keys),min(x_keys))
ax.set_ylim3d(min(y_array), max(y_array))
ax.set_zlim3d(min(z_array), max(z_array))
fig.show()

Putting the Verification Together

Now to see the graphs combined tells us the surface “shape” and allows us to see each observation made. We can leverage matplotlib’s ability to plot multiple plots on the same graph to achieve our final verification output:

Combined Image of Scatter Plot Points, X-Axes Polygons and Y-Axis Polygon

Thanks to the matplotlib polygon demo code for helping me figure this graph type out.

Generating a Topographic Map

Now to convey the total surface to a geotechnical engineer I wanted to provide a topographic map. As while the points are useful and one can figure out how much total fill is needed, there are surface irregularities which are not readily apparent hiding among the data.

Drawing Smooth Curves

One challenge in generating a topographic map is that one needs isopleths which are continuous lines and not simply discrete points. Here we can leverage interpolation to connect our discrete points generating smooth curves. I used an interpolation function from matplotlib’s triangular grid module and picked an interpolation class which visually fit my understanding of our lightly sampled land. I would have fallen back onto various spline drawing techniques myself, if I had to implement such an interpolation method and am not sure of the internals of matplotlib’s. More can be found in their documentation on the CubicTriInterpolator class.

The challenge for using the interpolation class was we again needed a different layout of our observed data. This time we needed a regular grid on which to lay our data. This was quite easy to generate and I picked a grid from my min to max observation along each axis with a spacing of one unit (equal to one foot) on each grid. This was achieve via the numpy methods:

x_grid = np.arange(0, max(x_array), 1)
y_grid = np.arange(0, max(y_array), 1)

I could again provide the x, y and z arrays to the interpolation functions at least and this provided me back a large numpy array with a 2-dimensional matrix of lines along the x-axis for each y-value on my grid.

# interpolate the data between our observation lines
triang = tri.Triangulation(x_array, y_array)
# Cubic looks more like the understood lay of the land
# interpolator = tri.LinearTriInterpolator(triang, z_array)
interpolator = tri.CubicTriInterpolator(triang, z_array,
                                        kind='min_E',
                                        trifinder=None,
                                        dz=None)
Xi, Yi = np.meshgrid(x_grid, y_grid)
zi = interpolator(Xi, Yi)

# plot contour surface
fig = plt.figure()
fig.suptitle(f"Contour Surface of {cu_yrds_fmt_interp}cu. yrds.\nFill(Cut) Needed for Level", fontsize=16)
plt.xlabel('North-South distance (ft)')
plt.ylabel('East-West distance (ft)')
levels = 15
cs = plt.contourf(-1*x_grid, y_grid, zi, levels)
cbar = fig.colorbar(cs, format='%1.1f ft')
plt.show()

# plot contour plot (2D topomap)
fig = plt.figure()
fig.suptitle(f"Contour Plot of {cu_yrds_fmt_interp}cu. yrds.\nFill(Cut) Needed for Level", fontsize=16)
levels = 10
cs = plt.contour(-1*x_grid, y_grid, zi, levels, linewidths=0.5)
plt.xlabel('North-South distance (ft)')
plt.ylabel('East-West distance (ft)')
plt.clabel(cs, cs.levels[:levels], fmt='%1.1f ft', inline=True, fontsize=10)
plt.show()

Of Numerical Integration and Estimating Fill Quantity

One key result I hoped to arrive at was the expected quantity of fill I would need to bring my building pad to level. To do this one can use a technique called numerical integration. A trivial technique is the trapezoidal rule which numpy thankfully implements. Here, we take the value for each line along the x-axis to determine how much area there is under that curve in two-dimensions and multiply it by how much distance there is to the next line of observations to get a three-dimensional value — how many cubic yards of fill I need acquire.

# use the observed values making each slice as thick
# as the space to the next observation slice
# hold tuple of (x_val, integral)
x_integrals=[]
for k in x_keys:
  y_keys = list(heights[k].keys())
  y_keys.sort
  # zero measures that are cut slopes (negative fills)
  non_neg_values = [max(heights[k][v],0) for v in y_keys]
  one_d_integration = np.trapz(y_keys, non_neg_values)
  # only record integrations and x values that are non-zero
  if one_d_integration > 0:
      x_integrals.append((k, one_d_integration))

# find distance between x-rows
x_rows = [x_val for (x_val,integral) in x_integrals]
row_spaces = list(np.diff(x_rows))
# get distance to last observation row
row_widths.append(max(x_keys) - max(x_rows))

# number of cubic yards of fill needed
cubic_yards = sum(np.array(row_widths) * [integral for (x_val,integral) in x_integrals])/27
# with commas
cu_yrds_fmt = "{:,}".format(int(cubic_yards.round()))

A very neat thing is that we can also use a similar process to integrate over our interpolated data which is more smooth than our few points of observation. Here the difference is about two of my tandem-axle dump trucks full out of at least 152 loads.

If you would like to see the whole file and all the data and code necessary to make this example, feel free to take a look at the repository of code and data here. It can be run with a Python 3, matplotlib 3.1.1 and any modern numpy.

Posted in Uncategorized | Tagged python, surveying, topomap | Leave a response

Debugging HBase Unit Tests

By Clay on November 13, 2016

This is likely an obvious process for those who use IDE’s and develop in Maven daily but for those who do operations or otherwise need to work on the JUnit tests in HBase only infrequently, here’s how I worked when submitting a patch for HBASE-16700.

First, create your code:

Here I was adding a MasterObserver coprocessor to HBase, so I could work relatively easily writing my code as it was simply one class. I was able to do the following — very crude — workflow:

Add the following to my HBase master’s hbase-site.xml:
<property> <name>hbase.coprocessor.master.classes</name> <value>org.apache.hadoop.hbase.security.access.AccessController</value> </property>
export CLASSPATH=$(hbase classpath)
vi <my code>.java
javac <my code>.java
Copy my class file into my HBase master’s lib directory
Restart my HBase master

Next, create a test:

This is the novel part to operators, you simply need to create a file under the relevant directory for the feature you are committing but in traditional Java fashion it will be under src/test while your feature will go under src/main. HBase has some guidelines on writing a test. Similarly, a useful class for writing HBase-server tests which need a minicluster is HBaseTestingUtility. Remember to write positive and negative tests (prove that your code does what you expect and handles unexpected operations gracefully).

Testing your test

To test your test you can ask Maven to run a build and test just your test class via the following: mvn -X test '-Dtest=org.apache.hadoop.hbase.security.access.TestCoprocessorWhitelistMasterObserver'.

Now, the -x is not necessary, it runs Maven in debug mode which is useful here. As to see the log output in your test you will want to run it standalone and Maven in debug mode will give you the proper incantation with the classes it built. You will see a line akin to the following while your test is forked off: Forking command line: /bin/sh -c cd hbase/hbase-server && /usr/lib/jvm/jdk1.8.0_101/jre/bin/java -enableassertions -Dhbase.build.id=2016-11-13T22:53:41Z -Xmx2800m -Djava.security.egd=file:/dev/./urandom -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true -jar hbase/hbase-server/target/surefire/surefirebooter5454815236698078750.jar hbase/hbase-server/target/surefire/surefire4890497615179486565tmp hbase/hbase-server/target/surefire/surefire_09143864480388952525tmp Running org.apache.hadoop.hbase.security.access.TestCoprocessorWhitelistMasterObserver

This line is useful as you can copy-and-paste it to run your test manually. A particularly useful feature is seeing log output. But also with this line you can attach a debugger too!

Attaching a debugger

To attach a debugger, one needs to launch Java with some options for it to wait until the debugger attaches. I was using the particular incantation: export JAVA_DEBUG='-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8000'. I would simply add $JAVA_DEBUG just after the /usr/bin/java. Here we ask the process to listen on port 8000 for the debugger to connect (and as one could guess, suspend=n will not wait for a debugger to connect).

Java ships a command-line debugger (jdb) but it has no command line history or class tab completion which is a pain. I used Andrew Pimlott’s rlwrap-jdb to provide these features. I could spin up a debugger with: CLASSPATH=hbase/hbase-server/target/test-classes/org/apache/hadoop/hbase/security/access/:hbase/hbase-server/target/ ./list-java-breakpoints 2>/dev/null > breakpoints_file && ./rlwrap-jdb --breakpoints-file breakpoints_file jdb -attach 8000.

Running a debugger on an already running process

As a side-note, getting familiar with the debugger is quite useful, as one can use this on production systems to inspect an already running Java daemon. From the JPDA Connection and Invocation documentation one can track down a number of Java debugger connector processes. The useful one for an already running process is the SA PID Attaching Connector run via jdb -connect sun.jvm.hotspot.jdi.SAPIDAttachingConnector:pid=<pid>.

Similarly, today I often take jmap -dump:format=b,file=<filename> dumps of misbehaving Java processes for later analysis with jhat but figure in the future I should perhaps investigate using sun.jvm.hotspot.jdi.SACoreAttachingConnector on core files of the misbehaving process to get a different view of the world.

Posted in HBase, Java | Leave a response

Finding HBase Region Locations

By Clay on June 8, 2015

HBase Region Locality

HBase provides information on region locality via JMX per region server via the hbase.regionserver.percentFilesLocal. However, there is a challenge when running a multi-tenant environment or doing performance analysis. This percent of files local is for the entire region server but of course each region server can serve regions for multiple tables. And further, each region can be made up of multiple store files each with their own location.

If one is doing a performance evaluation for a table, these metrics are not sufficient!

How to See Each Region

To see a more detailed breakdown, we can use HDFS to tell us where a file’s blocks live. Further, we can point HDFS to the files making up a table by looking under the HBase hbase.rootdir and build up a list of LocatedFileStatus objects for each file. Nicely, LocatedFileStatus provides getBlockLocations() which can provide the serving hosts for each HDFS block.

Lastly, all we need to do is correlate which region servers have local blocks for regions they are serving; now we can come up with a table locality percentage.

Implementation

One can do nifty things in the HBase shell as it is really a full JRuby shell. Particularly, one can enter arbitrary Java to run which works great debugging — or running performance tests. The following is the needed JRuby, which can be saved to a file and executed via hbase shell <file name> or simply copy and pasted into the shell.

require 'set'
include Java
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.HConstants
import org.apache.hadoop.hbase.HTableDescriptor
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.TableName
import org.apache.hadoop.io.Text
 
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import java.util.NoSuchElementException
import java.io.FileNotFoundException
 
# Return a Hash of region UUIDs to hostnames with column family stubs
#
# tableName - table to return regions for
#
# Example
# getRegionUUIDs "TestTable"
# # => {"3fe594363a2c13a3550f752db147194b"=>{"host" => "r1n1.example.com", "cfs" => {"f1" => {}, "f2" => {}},
#       "da19a80cc403daa9a8f82ac9a1253e9d"=>{"host" => "r1n2.example.com", "cfs" => {"f1" => {}, "f2" => {}}}}
#
def getRegionUUIDs(tableName)
  c = HBaseConfiguration.new()
  tableNameObj = TableName.valueOf(tableName)
  t = HTable.new(c, tableNameObj)
  regions = t.getRegionsInRange(t.getStartKeys[0],
                                t.getEndKeys[t.getEndKeys.size-1])
  # get all column families -- XXX do all regions have to host all CF's?
  cfs = HTable.new(c, tableNameObj).getTableDescriptor.getFamilies().map{ |cf| cf.getNameAsString() }
 
  r_to_host = regions.map{|r| [r.getRegionInfo().getEncodedName(), Hash["host" => r.getHostname(), "cfs" => Hash[cfs.map{|cf| [cf, Hash.new()] }]]] }
 
  Hash[r_to_host]
end
 
def findHDFSBlocks(regions, tableName)
  # augment regions with HDFS block locations
  augmented = regions.clone
  c = HBaseConfiguration.new()
  fs = FileSystem.newInstance(c)
  hbase_rootdir = c.select{|r| r.getKey() == "hbase.rootdir"}.first.getValue
  tableNameObj = TableName.valueOf(tableName)
  nameSpace = tableNameObj.getNamespaceAsString
  baseTableName = tableNameObj.getQualifierAsString
  # use the default namespace if nongiven
  nameSpace = "default" if nameSpace == tableName
 
  regions.each do |r, values|
    values["cfs"].keys().each do |cf|
      rPath = Path.new(Pathname.new(hbase_rootdir).join("data", nameSpace, baseTableName, r, cf).to_s)
      begin
        files = fs.listFiles(rPath, true)
      rescue java.io.FileNotFoundException
        next
      end
 
      begin
        begin
          fStatus = files.next()
          hosts = fStatus.getBlockLocations().map { |block| Set.new(block.getHosts().to_a) }
          augmented[r]["cfs"][cf][File.basename(fStatus.getPath().toString())] = hosts
        rescue NativeException, java.util.NoSuchElementException
          fStatus = false
        end
      end until fStatus == false
    end
  end
  augmented
end
 
def computeLocalityByBlock(regions)
  non_local_blocks = []
  regions.each do |r, values|
    values["cfs"].each do |cf, hFiles|
      hFiles.each do |id, blocks|
        blocks.each_index do |idx|
          non_local_blocks.push(Pathname.new(r).join(cf, id, idx.to_s).to_s) unless blocks[idx].include?(values["host"])
        end
      end
    end
  end
  non_local_blocks
end
 
def totalBlocks(regions)
  regions.map do |r, values|
    values["cfs"].map do |cf, hFiles|
      hFiles.map do |id, blocks|
        blocks.count
      end
    end
  end.flatten().reduce(0, :+)
end
 
tables = list
tables.each do |tableName|
  puts tableName
  begin
    regions = getRegionUUIDs(tableName)
    hdfs_blocks_by_region = findHDFSBlocks(regions, tableName)
    non_local_blocks = computeLocalityByBlock(hdfs_blocks_by_region)
    total_blocks = totalBlocks(hdfs_blocks_by_region)
    puts non_local_blocks.length().to_f/total_blocks if total_blocks > 0 # e.g. if table not empty or disabled
  rescue org.apache.hadoop.hbase.TableNotFoundException
    true
  end
end

One will get output of the form table-name newline float of locality percentage (0.0-1.0). Should the table be offline, deleted (TableNotFoundException), an HDFS block moved, etc. the exception will be swallowed. In the case of a table not being calculated, no float will return in the output (line simply skipped); in the case of HDFS data not being found, the locality computation will assume that block to be non-local.

Post-Script

Some nice follow-on work to make this data into a useful metric, might be to augment with the size of the blocks (in records or bytes) and determine a locality percentage on size not only blocks. Further, for folks using stand-by regions breaking out locality of replicated blocks may be important as well.

Posted in HBase, JRuby | Leave a response

Map/Reduce diff(1)

By Clay on June 8, 2015

This has sadly been a draft for years, so time to release it…

`diff`(1)

For those who use Unix, you have likely come across two files and wanted to see what was different between the two. Certainly, one can compare size (highly inaccurate), use a hash function (if a strong cryptographic hash, it will be accurate — but very information free) or one can use the obvious choice, diff(1). One usually gets output like

$ cat << EOF > one
foo
blah
baz raz
has
EOF
$ cat << EOF > two
blah
yar raz
has
EOF
$ diff one two
1d0
< foo
3c2
< baz raz
---
> yar raz

Here we see that the left file (file one) has an extra entry on line one and line three differs between the two files. Further, we can see that the algorithm matched lines, as blah was matched between the files despite the leading foo in file one.

Map/Reduce

Map/Reduce gained visibility after Google’s initial publication and certainly now that Hadoop has gained significant adoption. For my work, I mostly use Apache Pig which is a high-level language which compiles down to a map/reduce plan and runs on Hadoop Map/Reduce, Apache Tez and Apache Spark.

There are UDF approaches (such as the Pig built-in DIFF). The built-in DIFF does have one flaw for this work, in that it only accepts two bags (non-repeating, unordered data-structure) and as each set of data would be a bag, each file must fit into a container’s memory — not something efficient for differencing two large files.

For implementing code to generate a difference, I settled on two easy ways easy ways to operate. One was a UNION based approach, the other was a JOIN based approach. This allowed me to get the data from each file in one Pig data-structure (a relation), however, the approaches differ dramatically in row size of the relation.

Despite data size differences the run time performance a number of years ago was roughly parallel using Hadoop Map/Reduce. I found on 2012 hardware it took 10 minutes to difference over 200GB (1,055,687,930 rows) using LZO compressed input with 18 nodes. Further, each approach only takes one Map/Reduce cycle.

Also, one has to decide the quality of diff one would like; options range from line-numbers enumerating the records (lines) before the join if one were beginning a context-diff implementation to something as simple as should a match be reported or the count of matches (if a line is duplicated in a single source).

Simply, unlike the Unix diff(1) tool, order is not important; effectively the JOIN approach performs sort -u <foo.txt> | diff while UNION performs sort <foo> | diff.

Implementation

UNION

The UNION operator in Pig is like the SQL UNION operator. For differencing, one only needs to augment each file’s data with the data’s source, group and then count sources to find matches. While more lines of code than a JOIN approach, one can easily add in more metadata to each line (such as if the line is duplicated in each file but of a different quantity of repication).

Code

SET job.name 'Diff(1) Via Join'

-- Erase Outputs
rmf first_only
rmf second_only

-- Process Inputs
a = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS First: chararray;
b = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Second: chararray;

-- Combine Data
combined = JOIN a BY First FULL OUTER, b BY Second;

-- Output Data
SPLIT combined INTO first_raw IF Second IS NULL,
                    second_raw IF First IS NULL;
first_only = FOREACH first_raw GENERATE First;
second_only = FOREACH second_raw GENERATE Second;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();

JOIN

One can perform a difference via an outer-join as well. Here one has a more compact expression to achieve the desired results only doing a FULL OUTER join to only return records (lines) which appear in one file but not the other; then one can return the results to report the asymmetry. The JOIN approach does collapse duplicates (so, if one file has more duplicates than the other, this approach will not output the duplicate).

Code

SET job.name 'Diff(1)'

-- Erase Outputs
rmf first_only
rmf second_only

-- Process Inputs
a_raw = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;
b_raw = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;

a_tagged = FOREACH a_raw GENERATE Row, (int)1 AS File;
b_tagged = FOREACH b_raw GENERATE Row, (int)2 AS File;

-- Combine Data
combined = UNION a_tagged, b_tagged;
c_group = GROUP combined BY Row;

-- Find Unique Lines
%declare NULL_BAG 'TOBAG(((chararray)\'place_holder\',(int)0))'

counts = FOREACH c_group {
             firsts = FILTER combined BY File == 1;
             seconds = FILTER combined BY File == 2;
             GENERATE
                FLATTEN(
                        (COUNT(firsts) - COUNT(seconds) == (long)0 ? $NULL_BAG :
                            (COUNT(firsts) - COUNT(seconds) > 0 ?
                                TOP((int)(COUNT(firsts) - COUNT(seconds)), 0, firsts) :
                                TOP((int)(COUNT(seconds) - COUNT(firsts)), 0, seconds))
                        )
                ) AS (Row, File); };

-- Output Data
SPLIT counts INTO first_only_raw IF File == 1,
                  second_only_raw IF File == 2;
first_only = FOREACH first_only_raw GENERATE Row;
second_only = FOREACH second_only_raw GENERATE Row;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();

Reference

The original Stack-Overflow question I kept seeing and finally wronte an answer for.
A nice set-theoretic write-up about DIFF in Pig: Jacob Perkins on “Re: comparing two files using pig”

Posted in Hadoop, Pig | Leave a response

CloudStack

By Clay on October 22, 2012

Compute Infrastructure-as-a-Service

Today’s software development world is hosted on massive computing machines — lots of memory, lots of disk space, lots of CPU power. However, software development and testing is still often done at small scale; developers use vi, run unit tests running in python and run build scripts written for ant and mvn. How can one best use these massive machines for their development at small scale and still run tests on them at large scale, when necessary?

Unless you’ve been living in a cave the last few years, virtualization has been firmly burned in your mind by IT marketing material. In particular, taking those massive machines and cutting them up into many smaller virtual machines is the solution converged on by much of the industry. I agree! And, here are my notes on how I moved my group into this era.

CloudStack

Enter Citrix, Cloud.COM and now the Apache Foundation; CloudStack is an incubating Apache project. CloudStack is a very slick application which effectively implements Amazon’s EC2 UI, API and features — including a nice web front-end for starting and managing your VMs, storage and networks. However, as with all new software and certainly a piece of software which is as complex as a data center in a box, there are bugs and lots of knobs to turn for configuration.

Setup

I took a very conservative approach to working with CloudStack. I need only many VMs, running on the same network with little isolation and with only workable storage space and reliability. I do not need very high performance or high reliability. I only needed to slice up a few machines in the same physical datacenter. Further, I am currently using CloudStack 3.0.2 on CentOS 6.3 using the KVM hypervisor; as CloudStack development moves VERY quickly, I expect their upcoming 4.0 release will be very different and further, as the OS vendors do not stand still, I’m sure a different RedHat based distro or even CentOS version would be quite different.

I followed the CloudStack Quick Install guide and set up a Basic Zone. As I did want my VMs reachable from the outside (CloudStack that is) world, I needed to ensure I selected a network offering supporting Security Groups (DefaultSharedNetworkOfferingWithSGService).

Further, I reused another machine I had handy with a terabyte of disk space as my NFS server but did enable local storage for user VMs to stretch all the disks I could get at. I used one IP network for my management and guest networks. (I do hope to get the machines running bonded 1GigE soon for their physical connections though.) Simply to reduce IP usage, were I to do it again, I would have use a second non-routable (RFC1918 address space) and setup the management server to have acted as a NAT box to my broader network.

Configuration

CloudStack is much more centralized than some other infrastructure-as-a-server cloud offerings. One only needs to understand a few roles and daemons to understand the major touch points to CloudStack:

Management Server
- This runs the Tomcat server which hosts the UI and does most of the coordination activities amongst the various CloudStack components.
- /etc/init.d/cloud-management
- /var/log/cloud/management/management-server.log
- /var/log/cloud/management/catalina.out
Usage-Server
- This is the usage server which collects metrics from CloudStack for external analysis (e.g. billing)
- /etc/init.d/cloud-usage
- /var/log/cloud/usage/usage.log
Agent
- This runs on the CloudStack machines which host guest VMs.
- /etc/init.d/cloud-agent
- /var/log/cloud/agent/agent.log

This centralization of components makes configuration and debugging an easier process but still managing all the how-to documents for a system as big as CloudStack became a bit daunting; below are my most used how-to’s and pitfalls which I ran across.

Agent Reboots

One issue which was very confusing, was when I initially setup my compute hardware as CloudStack agents. They would immediate reboot; and cause the machine to keep rebooting! (This was not behavior I was expecting.) This taught me to check the logs early and check the logs often, as I found (in /var/log/cloud/agent/agent.log):

2012-10-09 16:18:50,466{GMT} WARN  [resource.computing.KVMHAMonitor] (Thread-27:) write heartbeat failed: Failed to create /mnt/031d9475-063d-30b5-b910-7ee710ff81b0/KVMHA//hb-172.20.7.136; reboot the host

Luckily, others had been here before. The fix was nicely documented and ever so easy:sed -i 's/reboot/#reboot/g' /usr/lib64/cloud/agent/scripts/vm/hypervisor/kvm/kvmheartbeat.sh. It also showed me an invaluable setting to enable outputting the DEBUG messages from the CloudStack agent: sed -i 's/INFO/DEBUG/g' /etc/cloud/agent/log4j-cloud.xml

Agent dies at start with: `Unable to start agent: Unable to find the guid`

Next, I had issues with starting the agent. The wizard or I would run cloud-setup-agent and then checking /etc/init.d/cloud-agent status would show the agent dead. This to was an easy fix which
again someone else had documented. One simply needs to add the following to their /etc/cgconfig.conf and restart their cfconfig service:

group virt {
  cpu {
    cpu.shares = 9216;
  }
}

Set your hostname

While the Quick Install Guide says to ensure your hostname is set (e.g. checked via the hostname --fqdn command) ensure that you have /etc/hosts and /etc/sysconfig/network set with your fully-qualified hostname. One error you may see, can be found in the ever helpful CloudStack forum.

Automatic VM Password Generation

To add password generation and reset support to your own templates, you can follow the instructions for CloudStack 4.0; I have tested the Linux script, at least. (There is also the ability to use ssh key-pairs, like Amazon EC2 does, but I have not yet tried that but it is well documented, if not supported by the UI.)

LDAP

Setting up LDAP for CloudStack is quite easy but it requires doing some setting outside the UI, and with the API, as documented in the instructions (or original). (There are some notes on using port 8096, as the documentation does.) There is also one bug CS-14680 which has to be worked around, as the LDAP authentication does not use MD5 hashing like the built-in MySQL authentication does.

Due to CS-14680, if you need to allow authentication against both LDAP and the built-in MySQL, then a bit of HTML changes are necessary too. The changes are documented in CS-16325.

Lastly, as one needs to setup the accounts for CloudStack to use from LDAP, there is a Ruby script which can synchronize your LDAP server to CloudStack. But remember, if setting up accounts from a LDAP server which might control sensitive services (e.g. Active Directory) in my case, you will likely want to use SSL on your Management Server, so that passwords are encrypted.

Usage

Cleanly Restart a Host

If you need to restart one of your VM hosting machines, there is a bit more forethought required than one would normally have for a Linux box. The steps are:

Mark it in maintenance
Then, restart
Mark it as available

If a machine is not properly shutdown:

Get it back online by toggling the maintenance state of the host
Look at zone’s system VMs — they might be in wedged starting state and need to be unstuck
1. May need to enable/disable zone
2. Restart the management server
3. Ensure the VMs are not running on the host they claim (using virsh) and set them to stopped in MySQL)

Storage Migration

One very cool feature of CloudStack is that you can migrate your VMs (live!) and you can migrate the storage they are running upon too (storage migration). This is especially useful, if using local storage and needing to move a VM off for host maintenance; but beware there is a good performance optimization which is left to be made to lessen the load on secondary storage when do a storage migration.

Local Storage

Using local storage is of huge help if your infrastructure does not have much shared storage. However, if you are like me, it is easiest to create templates which have relatively small root disks (say 20GB) but for many needs, you will then need to attach the bulk of the storage as an extra volume. While there is a check box to have a system’s root disk be local, there is no equivalent for a disk offering (for making said extra volumes).

I tried to implement local storage disk offerings by using storage tags. I set a tag on the local primary storage pools with tag “LOCAL” and made a disk offering requiring the volumes to be made on pools with only with tag “LOCAL”, but this failed. I could create the volume (but that only makes a database record in CloudStack and does not actually pick out storage; when I attached the storage to the VM (and CloudStack would actually create the volume), it failed. I got:

2012-10-18 22:27:45,385 DEBUG [storage.allocator.AbstractStoragePoolAllocator] (Job-Executor-95:job-524) Checking if storage pool is suitable, name: cloud0.domain ,poolId: 211
2012-10-18 22:27:45,385 DEBUG [storage.allocator.AbstractStoragePoolAllocator] (Job-Executor-95:job-524) Is localStorageAllocationNeeded? false
2012-10-18 22:27:45,385 DEBUG [storage.allocator.AbstractStoragePoolAllocator] (Job-Executor-95:job-524) Is storage pool shared? false
2012-10-18 22:27:45,385 DEBUG [storage.allocator.AbstractStoragePoolAllocator] (Job-Executor-95:job-524) StoragePool is not of correct type, skipping this pool
2012-10-18 22:27:45,385 DEBUG [storage.allocator.FirstFitStoragePoolAllocator] (Job-Executor-95:job-524) FirstFitStoragePoolAllocator returning 0 suitable storage pools
2012-10-18 22:27:45,385 DEBUG [cloud.deploy.FirstFitPlanner] (Job-Executor-95:job-524) No suitable pools found for volume: Vol[142|vm=106|ROOT] under cluster: 6
2012-10-18 22:27:45,385 DEBUG [cloud.deploy.FirstFitPlanner] (Job-Executor-95:job-524) No suitable pools found
2012-10-18 22:27:45,385 DEBUG [cloud.deploy.FirstFitPlanner] (Job-Executor-95:job-524) No suitablestoragePools found under this Cluster: 6
2012-10-18 22:27:45,385 DEBUG [cloud.deploy.FirstFitPlanner] (Job-Executor-95:job-524) Could not find suitable Deployment Destination for this VM under any clusters, returning.

This was annoying, the volume I explicitly wanted local, CloudStack was trying to make shared and was ruling out the local pools! However, thankfully someone who was trying to run their cloud without any shared storage and hit upon a solution in CS-11840. (Despite the original filer claiming this failed; UPDATE disk_offering SET use_local_storage = 1 WHERE display_text LIKE "%LOCAL%"; worked for me on 3.0.2.) This did not solve the whole problem immediately, however, as I was trying with local VMs and local storage getting the following error:

2012-10-18 16:11:51,537 DEBUG [cloud.async.AsyncJobManagerImpl] (http-6443-exec-4:null) submit async job-440, details: AsyncJobVO {id:440, userId: 3, accountId: 3, sessionKey: null, instanceType: Volume, instanceId: 104, cmd: com.cloud.api.commands.AttachVolumeCmd, cmdOriginator: null, cmdInfo: {"response":"json","id": "f0089a1b-32f4-4e49-89fa-0dfe0935b4b4","sessionkey":"r2D/wVCutA/UwOkAXxtfzSDjU7o\u003d","ctxUserId":"3","virtualMachineId":"fe907255-063a-4e72-95a3-43abe53f1867 ","_":"1350591111196","projectid":"6c8ef680-752f-47d4-a0a1-fe9d68197a18","ctxAccountId":"3","ctxStartEventId":"4420"}, cmdVersion: 0, callbackType: 0, callbackAddress: null, status: 0, processStatus: 0, resultCode: 0, result: null, initMsid: 964251491601, completeMsid: null, lastUpdated: null, lastPolled: null, created: null}
2012-10-18 16:11:51,538 DEBUG [cloud.async.AsyncJobManagerImpl] (Job-Executor-11:job-440) Executing com.cloud.api.commands.AttachVolumeCmd for job-440
2012-10-18 16:11:51,721 INFO  [cloud.api.ApiDispatcher] (Job-Executor-11:job-440) Please specify a volume that has been created on a shared storage pool.

But. once I realized that I do not care if the small root disk is shared, as long as the massive storage volume is local, I had success with shared storage VMs having local volumes.

Deleting a Zone

If you need to delete a zone, you will want to make sure to follow the correct steps to delete the zone or it is possible to get the zone wedged in an un-deletable state (e.g. CS-14297: [Can] not delete primary storage without going into the database). The steps and correct order are outlined in CS-15991.

Troubleshooting

Much troubleshooting in is done via investigating the MySQL database which underlies Cloudstack, sometimes things get wedged enough that they require changes. The database schema is very easy to understand. While one needs to be restrained in modification (as referential integrity can be compromised causing confusion, if a row is removed or an incorrect ID is entered) there are constraints running around to try and prevent errant state.

Storage Issues

Storage issues can be some of the most insidious issues one will encounter with CloudStack. Errors can be bizarre! Issues I ran across with templates alone:

Templates are listed in the UI under “Templates” but not visible when I go to create a VM
- Usually you can see that the template is still downloading or had an error when clicked on in the UI under templates.
- These issues can often be resolved by verifying the Secondary Storage VM (SSVM) is working okay. First, make sure your SSVM even started by going to Infrastructure->Zones->System VMs and ensuring your SSVM is “Running”. Luckily, there is a nice write-upon how to check the SSVM for its other common sicknesses.
- Deleting a template or a template is wedged with Failed post download script: Checksum failed, not proceeding with install fixed in CS-14555. This happened to me on the built-in CentOS 5.6 VM template, which I simply wanted to remove, but since wedged had no UI option to remove it. As such, I followed the (now slightly outdated) steps in a forum thread.
Uploading a template/ISO fails
- Get “Connection Refused” as a template status trying to upload but this is by design that templates can only be uploaded from accepted sites. I simply had to change the secstorage.allowed.internal.sitesconfiguration variable to allow the host.
- Got “Please specify a valid qcow2” uploading a template may fail due to the file name not ending in .qcow2.
- Trying to upload an ISO kept failing for me reporting java.lang.IllegalStateException: java.lang.IllegalStateException: unsupported protocol: 'ftp' with an http:// URL, so I filed a bug CLOUDSTACK-370 which awesomely got two responses before I could even git clone the CloudStack source and reproduce the issue.

One painful issue I encountered was when my CloudStack hosts acquired hostnames in DNS, a few days after the cloud was setup. The NFS server providing my primary storage had the cloud machine’s IP addresses in /etc/exports ACL to ensure they could mount, write with no root squashing. But when the hosts entered DNS, the server started rejecting their mount requests and write updates too! (I was using a wild-card for the hosts’ IP addresses for access control which may explain why the hosts were rejected after acquiring hostname (this NFS guide provide’s Do not use wildcards in IP addresses, as they are intermittent in IP addresses..) This lead to the quizzical error, when trying to start a new VM:

2012-10-18 22:18:06,516 DEBUG [storage.allocator.AbstractStoragePoolAllocator] (Job-Executor-93:job-522) Cannot allocate this pool 207 for storage since its allocated percentage: Infinity has crossed the allocated pool.storage.allocated.capacity.disablethreshold: 0.85, skipping this pool

Luckily, again I was not the first to encounter this weird issue with infinity issue; indeed like the previous poster, my MySQL database had 0 for the allocated and available bytes for my primary storage and after fixing the /etc/exports all was happy.

Addding Hosts

For a few quick tests, I had removed one of my agent nodes and came across this strange issue when trying to add the host back to the original zone:

libvir: Storage error : Storage pool not found: no pool with matching uuid
2012-10-15 03:52:00,287{GMT} WARN  [utils.nio.Task] (Agent-Handler-1:) Caught the following exception but pushing on
java.lang.NullPointerException
        at com.cloud.agent.storage.LibvirtStorageAdaptor.createStoragePool(LibvirtStorageAdaptor.java:504)
        at com.cloud.agent.storage.KVMStoragePoolManager.createStoragePool(KVMStoragePoolManager.java:57)
        at com.cloud.agent.resource.computing.LibvirtComputingResource.initialize(LibvirtComputingResource.java:2978)
        at com.cloud.agent.Agent.sendStartup(Agent.java:316)
        at com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:846)
        at com.cloud.utils.nio.Task.run(Task.java:79)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:679)

This had me puzzled for a bit, as I had not needed to use virsh(1) much before this adventure, It seems the old storage pools were still present which was preventing their being re-created. I believe the error was akin to:

[clayb@cloud_machine ~]$ sudo virsh pool-create /tmp/test_pool.xml
error: Failed to create pool from /tmp/test_pool.xml
error: operation failed: Storage source conflict with pool: '031d9475-063d-30b5-b910-7ee710ff81b0'

Eventually, I did the following to good success:

Checked if virsh reported any storage pools in existence (since this host was not successfully added, it should not have had any) — virsh pool-list
Ensured all pools were destroyed with virsh pool-remove <pool>
Cleaned-up any residual files in /etc/libvirt/storage/
Cleaned-up my residual files in my machine’s local storage volume /var/lib/libvirt/images

Passwords

CloudStack encrypts passwords from what I have seen in the MySQL database and configuration files. Indeed this is a change for the 3.0 release. To encrypt passwords like CloudStack one can do the following.

Admin Password Reset

When needing to reset the administrator password for CloudStack, one must resort to modifying the MySQL database, but the procedure is quite painless.

System VM passwords

There is a useful setting if you want to ensure the system VMs are only accessible via SSH key, called system.vm.random.password which should be good. I have verified the /etc/shadow file has a different hash for root between system VM instances, but I did have a problem on my first setting of this. I got the following log message, after seeing the management server was in a wonky state (no MySQL logins worked):

[cbaenziger1@cloud_machine ~]$ grep 'Error while decrypting:' /var/log/cloud/management/management-server.log
2012-10-15 06:27:28,120 DEBUG [utils.crypt.DBEncryptionUtil] (main:null) Error while decrypting: VG3fYbhx

Your failed decrypting string will likely vary; mine did! I verified and tried resolving the issue by doing the following:

mysql> USE cloud;
mysql> SELECT name,value FROM configuration WHERE value LIKE "%VG3fYbhx%";
| system.vm.password | VG3fYbhx |
mysql> UPDATE configuration SET value = "false" WHERE name = "system.vm.random.password";
Query OK, 1 row affected (0.07 sec)
Rows matched: 1  Changed: 1  Warnings: 0

But, I still had issues starting the Management Server:

2012-10-15 06:38:28,977 DEBUG [utils.crypt.DBEncryptionUtil] (main:null) Error while decrypting: VG3fYbhx
2012-10-15 06:38:28,978 ERROR [utils.component.ComponentLocator] (main:null) Unable to load configuration for management-server from components.xml net.sf.cglib.core.CodeGenerationException: org.jasypt.exceptions.EncryptionOperationNotPossibleException-->null

Realizing that the value in system.vm.password did not look like an encrypted password, I looked in the database for another encrypted string I could use and ended up copying the value from secstorage.copy.password. Then, I could start the Management Server; and have since re-enabled system.vm.random.password but I do not see the value in system.vm.password changing.

Default Passwords

I have also seen one security disclosure on CloudStack. And while CloudStack seems solid, like I do my Hadoop cluster (which as of CDH3U5 does not have such wholistic security) I will certainly keep my Cloud infrastructure off the hostile Internet.

Make sure to change the default for the admin user too!

Posted in Uncategorized | Leave a response

Configuration Files

By Clay on October 21, 2012

Many systems have requirements to store configuration parameters. In these systems, a number of choices can be made for how to store that data; sometimes this diversity is painful, however. Choices for storing configuration data are often:

Firefox uses and Apple often chooses to use sqlite3 databases¹
Python programs often use ConfigParser to processÂ initialization (.ini) files
Apache Ant amongst many other applications, consume XML configurations
Java programs often use Properties files — in XML or traditional form
Java Script Object Notation (JSON) is used by programs for configuration; a number of my group’s programs use this, for example
Domain Specific Languages (DSLs) are sometimes used. For example, the Puppet configuration management system has its own DSL written in Ruby

This diversity of configuration formats sometimes sees cross pollination, however. Sometimes, an application only reads in one format but another application only outputs another format. Sometimes, one has a toolset which works with only one and many an application grown organically can find itself using many formats itself.

Annoyingly, not all formats support the same set of features either. For example, SQLite3 and XML can be multidimensional; SQLite3 supports multiple N-row by M-columns sized tables in a SQLite3 file, while XML support a hierarchical tree structure of tags with with multiple leaves using attributes on tags. JSON is comparable to XML, offering rich structure for organizing one’s data. The initialization file implementation in Python is only a two-level hierarchy; Java Properties files are flat but often use Java dot-notation to make namespaces which can represent an arbitrarily deep hierarchy. Domain specific languages can be as rich or simple as desired, but there is no commonality or properties inherent in such a configuration format.

This asymmetry can make conversion across formats difficult in general but one should always be able to go from a less rich to a more rich structure. And when possible, it is nice to have some tools to go between them.

Java Properties Files

Using with Python

One can find a recipe to read and write Java Properties files from Python. This re-implementation of the java.util.Properties class provides a convenient interface for working with properties files:

>>> import properties
>>> p=properties.Properties()
>>> with file("my.properties") as f:
...     p.read(f)
>>> p.getPropertyDict()['some_property_I_want']
'this_is_not_the_property_value_you_want!'
>>> p.setProperty('some_property_I_want', 'with_the_value_I_want!')
>>> with file("my.properties") as f:
...     p.store(f)

Properties in XML

One can write an XML version of a Java properties file within Java by simply calling the storeToXML() method on a Properties() object.

Oozie’s XML outputs

I use a lot of Hadoop programs which store their outputs in various XML forms, but one which always drives me nuts is Apache Oozie. Oozie will dump out a workflow job configuration in XML; but not a standard Java XML properties file. Oozie takes in the workflow properties as a non-XML Java properties file provided but it will not accept the XML it produces. However, via the joys of XML Style Sheet Transforms, we can write a simple script which can convert between the two!

An example (Oozie) Properties file in XML:

<configuration>
  <property>
    <name>date</name>
    <value>2011-12-01T00:00Z</value>
  </property>
  <property>
    <name>endTime</name>
    <value>2011-12-01T23:59Z</value>
  </property>
  <property>
    <name>frequency</name>
    <value>1440</value>
  </property>
  <property>
    <name>group.name</name>
    <value>users</value>
  </property>
  <property>
    <name>jobTracker</name>
    <value>jobtracker.example.com:9001</value>
  </property>
  <property>
    <name>nameNode</name>
    <value>hdfs://namenode.example.com:9000</value>
  </property>
  <property>
    <name>oozie.coord.application.path</name>
    <value>/export/my_workflow/coordinator.xml</value>
  </property>
  <property>
    <name>oozie.wf.application.path</name>
    <value>hdfs://namenode.example.com:9000/user/john_doe/my_workflow/workflow.xml</value>
  </property>
  <property>
    <name>queueName</name>
    <value>default</value>
  </property>
  <property>
    <name>startTime</name>
    <value>2011-12-01T00:00Z</value>
  </property>
  <property>
    <name>user.name</name>
    <value>john_doe</value>
  </property>
</configuration>

General XSLT transformation from XML to Java properties file

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" version="1.0" omit-xml-declaration="yes"/>
  <xsl:template match="/*">
    <xsl:for-each select="property">
      <xsl:value-of select="name"/><xsl:text>=</xsl:text><xsl:value-of select="value"/><xsl:text>&#xa;</xsl:text>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>

Resulting Java properties file

date=2011-12-01T00:00Z
endTime=2011-12-01T23:59Z
frequency=1440
group.name=users
jobTracker=jobtracker.example.com:9001
nameNode=hdfs://namenode.example.com:9000
oozie.coord.application.path=/export/my_workflow/coordinator.xml
oozie.wf.application.path=hdfs://namenode.example.com:9000/user/john_doe/my_workflow/workflow.xml
queueName=default
startTime=2011-12-01T00:00Z
user.name=john_doe

For those who are not very programming language literate, on Linux, one can nicely use the simple libxml tool xsltproc(1) to run this conversion. For example, to take in my_config in Java properties XML format and product the same file in Java properties format one would run: xsltproc to_property.xslt my_config.xml > my_config.properties

JSON

JSON provides a rich language for expression similar to XML. JSON is often used for data interchange, now often used in AJAX web-requests, etc. However, JSON,

Using with Python

Python has a very feature-rich JSON module which takes the JSON objects and arrays and all their pairs and members representing them akin to native Python list() and dict() objects. Further, the JSON module can provide very rich encoding and decoding functionality, as evidenced in the module’s PyDoc and particular when using hooks for encoding and decoding.

Posted in Hadoop, Python | Leave a response

Alias does not work in KSH functions

By Clay on August 14, 2012

Alias does not work as expected in KSH functions

One will sometimes read that a function is recommended over an alias, but it is not always obvious why. Certainly, one can do more in a more syntactically elegant way in a function than an alias; but why else?

Today, I ran into an aggravating situation. I found that a command which is set-up by sourcing a script with some alias commands in it. But, the script was failing to resolve the alias. I got alias_command: command not found instead of proper resolution. Even more aggravating, running the type built-in on the alias showed that it existed, and was set as expected, but to no success when calling it.

The problem seems to be that the alias is unavailable when the function is called. I do not quite understand why, but in the O’Reilly book Classic Shell Scripting: Hidden Commands that Unlock the Power of Unix, Figure 7-1 shows that alias resolution and where functions are looked up, happen at very different points in the parsing stack; it seems though that upon the eval loop for a calling a function should resolve the alias?

See below for a simple test-case to present the issues; and show some work-arounds.

Simple alias definition in a function

This fails! This is the initial example of the failed behavior.

#!/bin/ksh

# aliases do not seem to work in the function in which they are defined
function alias_in_function_does_not_work {
    print "\n\nalias in a function does not work:"
    alias bar='ls'
    type bar
    bar
}

alias_in_function_does_not_work

alias in a function does not work:
bar is an alias for ls
/tmp/t.sh[11]: alias_in_function_does_not_work[8]: bar: not found [No such file or directory]

Simple alias definition in a function with eval

This works! The extra evalin the following code block causes the shell to properly parse the alias.

#!/bin/ksh

# aliases work in functions if preceeded with eval
function alias_with_eval_works {
    print "\n\nalias in a function works with eval:"
    alias foobar='ls'
    type foobar
    eval foobar
}

alias_with_eval_works

alias in a function works with eval:
foobar is an alias for ls
file

Functions can replace aliases successfully

This works! A function can replace an alias and perform (often) the same behavior. However, more thought is needed if you want to use alias substitution in clever ways.

#!/bin/ksh

# a function being called by a function is okay
function baz {
    print "functions work:"
    ls
}

function function_works {
    print "\n\nfunctions calling functions work:"
    baz
}

function_works

functions calling functions work:
functions work:
file

Where the alias is defined matters

I do not recommend this! Here we show the code is indeed linearly parsed. A function can used an alias defined earlier in the code. (But this gets awfully convoluted quickly!).

#!/bin/ksh

# a pre-existing alias can be called only if
# after the function definition in the script
function pre_existing_alias_does_not_work {
    print "\n\nalias already defined does not yet work:"
    type foo
    foo
}

# example showing aliases do not resolve in functions
print "\n\nbare alias works:"
alias foo='ls'
type foo
foo

pre_existing_alias_does_not_work

# a pre-existing alias can be called only if
# after the function definition in the script
function pre_existing_alias_now_works {
    print "\n\nalias already defined now works:"
    type foo
    foo
}

pre_existing_alias_now_works

bare alias works:
foo is an alias for ls
file

alias already defined does not yet work:
foo is an alias for ls
/tmp/t.sh[16]: pre_existing_alias_does_not_work[7]: foo: not found [No such file or directory]

alias already defined now works:
foo is an alias for ls
file

Defining aliases in functions fails for other functions

This fails! One can not create an alias in a function and then use it in another function, but one can in the main-line code.

#!/bin/ksh

function setup_alias {
    print "\n\nalias setup..."
    alias foo='ls'
}

# a pre-existing alias can not be called if
# the alias was defined in a function in the script
function use_alias {
    print "\n\nalias already defined does not work:"
    type foo
    foo
}

setup_alias
use_alias
print "\n\nbare alias works:"
type foo
foo

alias setup...

alias already defined does not work:
foo is an alias for ls
/tmp/t.sh[18]: use_alias[14]: foo: not found [No such file or directory]

bare alias works:
foo is an alias for ls
file

In summary…

If you write shell scripts with functions, alias resolution really matters but may not be obvious as to how it is getting resolved or why. Certainly, if you have answers or resources to better explain this, please leave a comment.

Posted in ksh | 2 Responses

Accessing Kerberized HDFS via Jython

By Clay on May 4, 2012

Why Kerberos?

So, you want to do some testing of your shiny new Oozie workflow, or you want to write some simple data management task — nothing complex — but your cluster is Kerberized?

Certainly there are many reasons to use Kerberos on your cluster. A cluster with no permissions is dangerous in even the most relaxed development environment, while simple Unix authentication can suffice for some sharing of a Hadoop cluster — but to be reasonably sure people are not subverting your ACL’s or staring at data they should not be, Kerberos is currently the answer.
Continue reading “Accessing Kerberized HDFS via Jython”

Posted in Hadoop, Jython | 2 Responses

When do people work?

By Clay on September 26, 2010

Ever wonder when people are actually working?

It can be hard answering, “when are people at work?” On a distributed team, with many co-workers and the typical corporate dotted-line type relationships, it is even harder! Inevitably communications on schedule shifts and desired schedules go un-communicated. A few years ago, this occurred for folks I worked with.
Continue reading “When do people work?”

Posted in analysis, boxplots, e-mail, Gnu R, statistics | Leave a response

How I work with IPS repos from the slim_source gate

By Clay on August 5, 2010

Oh, how I knew System V packages…

Back in the bad old days before ON and slim_source had moved to building only IPS packages, one could pkgadd -d <location> SUNW<package> and easily drop their test code on a machine. Now with the move to IPS packages getting the test code to a machine can be much easier but set up is a bit more complicated. There is a tool to do this automatically for ON called onu (see it in action here). However, for slim_source it is pretty easy to do manually — once you know what you need to do.

Continue reading “How I work with IPS repos from the slim_source gate”

Posted in Solaris Install | Leave a response

Technical and Personal Ramblings

Builder’s Level

Verify the Data

Scatter and Polygon Plots

Putting the Verification Together

Generating a Topographic Map

Drawing Smooth Curves

Of Numerical Integration and Estimating Fill Quantity

First, create your code:

Next, create a test:

Testing your test

Attaching a debugger

Running a debugger on an already running process

HBase Region Locality

How to See Each Region

Implementation

Post-Script

diff(1)

Map/Reduce

Implementation

UNION

Code

JOIN

Code

Reference

Compute Infrastructure-as-a-Service

CloudStack

Setup

Configuration

Agent Reboots

Agent dies at start with: Unable to start agent: Unable to find the guid

Set your hostname

Automatic VM Password Generation

LDAP

Usage

Cleanly Restart a Host

Storage Migration

Local Storage

Deleting a Zone

Troubleshooting

Storage Issues

Addding Hosts

Passwords

Admin Password Reset

System VM passwords

Default Passwords

Java Properties Files

Using with Python

Properties in XML

Oozie’s XML outputs

JSON

Using with Python

Alias does not work as expected in KSH functions

Simple alias definition in a function

Simple alias definition in a function with eval

Functions can replace aliases successfully

Where the alias is defined matters

Defining aliases in functions fails for other functions

In summary…

Why Kerberos?

Ever wonder when people are actually working?

Oh, how I knew System V packages…

Archives

Tags

`diff`(1)

Agent dies at start with: `Unable to start agent: Unable to find the guid`