Ever wonder when people are actually working?
It can be hard answering, “when are people at work?” On a distributed team, with many co-workers and the typical corporate dotted-line type relationships, it is even harder! Inevitably communications on schedule shifts and desired schedules go un-communicated. A few years ago, this occurred for folks I worked with.
So what does this mean?
I decided to graph people’s e-mail times, as below:
What catalogs when we actually work?
To try and passively monitor working hours was my desire. I had assumed actively asking would be biased or again go forgotten when schedules shift (e.g. school year starts for one’s kids or summer leads to desired morning bicycle commutes). Further, actively asking people to “punch-in” and “punch-out” would be a pain and pretty foreign to engineers. So, that begs the question, what is a good proxy of working hours?
For many teams, IRC or instant messenger log-in and log-off times can work. But of course, some folks stay logged-in 24 hours by 7 days a week; or drop off often due to flaky network connections. Code commit and bug filing times can work, if everyone on the team is doing a number of code commits and bug files. However, if these events are relatively rare it makes the proxy less valuable. For my research, I settled on e-mail times as e-mail is very popular in communities I work in.
Now, unless you happen to be a pack rat, it can be difficult to muster a large corpus of e-mail data but luckily, if you work on an open source project — with an external mailing list — then all data is retained in the vast archives of usually a GNU Mailman list. There are various niceties about Mailman, one can use the files retained on the server or simply trawl the Pipermailweb interface for data on who posted to the list when. Then it’s a simple matter to develop an XML or CSV (comma separated value) file of who posted what when and use your favorite graphing package to view the data.
I choose to use Python with LXML to parse the OpenSolaris Mailman archives for the list I was interested in and produce an XML representation of the data (see the script here). From this, I was able to easily construct a CSV file which I could load into GNU R for slicing in interesting ways.
Great, data’s fun and all — but now what?
In my case, I was mostly interested in co-workers around North America to see my relative working hours and averages. (I was looking at a pretty short period of time so unconcerned with seasonality and other variance.)
Especially in engineering, working hours can range from 40-50 hours a week to “crunch time” sprints of 60-70+. As such, how can one try to normalize out and see trends in data so that a 4am “crunch time” e-mail does not throw off an average otherwise around 9am-5pm? Though, not quite the most robust technique I settled on a series of box plots for my coworkers. However, this came with pitfalls for simply comparing those who would be working in very distant timezones.
Using a simple time format of 0-2359 for 12:00am to 11:59pm, I viewed everyone’s e-mail posts in my local timezone. Unfortunately, this is not so robust as someone in Europe will span from around 8pm to 8am which in my representation will be a boxplot with a median of noon my time but in reality is the inverse of what is shown. (Notice that the folks in California in my example image have a lot of “outliers” in the morning.) However, being roughly in the temporal middle of the Americas living in Colorado, this approach worked reasonably well for comparing with other Americans. To finally view the data I simply used a GNU R script and a hastily written script to chomp down some CSV data I produced from the XML I had scraped.