Cloud based Econometrics and Statistics Software

While the rest of the world was busy with Apple’s iCloud, I spent the past few weeks working on a large-scale empirical project.

In the process I learnt a few things about cloud-based options for statistics and econometrics. The situation has developed quite a bit since Robert Grossman’s earlier post on using Amazon’s cloud for this purpose. Amazon now has a browser-based graphical dashboard to easily manage your cloud-based machines, instead of relying upon command line tools.

In my view, there are three relevant areas that are at different stages of cloud-readiness for empirical economists and statisticians:

1. Databases and datasets

Cloud based solutions are great especially for large datasets that require scaleability. They are also good for research projects that require multiple people to access them (e.g., if your project involves multiple coauthors or research assistants).

For simple projects with small datasets, a shared spreadsheet on Google Docs should suffice. For larger datasets, one good option is Amazon RDS, which is price-competitive and offers both SQL and Oracle databases; it is easy to maintain and backup. Another option is Microsoft Azure. We use Postgresql and Ubuntu on EC2 for analysing patent data.

One advantage of cloud based databases is that the technology is now mature. Demand is driven by many other business and scientific applications. We therefore benefit from positive knowledge spillovers. It is relatively easy and inexpensive to hire an RA with a computer science background to build an SQL-based dataset. A second advantage is that a number of other research databases are now starting to appear in the cloud making them easy to interface with cloud-based programming. This includes data on patents, genes, the US Federal Reserve and Census data. The number of research databases in economics and the social sciences is still small, but growing.

2. Regression and statistical software

This is where it is disappointing. Most of the popular software packages are not widely affordable for cloud use, including Stata and SPSS. You hit licensing snags. A small number of private service providers bridge this gap by offering High Performance Computing (HPC solutions), e.g., for Mathematica, but they are pretty expensive (at least to an academic researcher). Matlab will work but requires a ‘distributed server’ license that will cost a fortune. In general, these software companies want to sell you a 2-core or 4-core license that will run year-long on that computer on your desktop. What some of us need instead is a license that will run on 64 cores across 16 machines for just one month during which we are doing intensive number-crunching. More importantly, we want that license to be easy to transact, not to go through a complicated application and registration process. You might think this sort of licensing doesn’t exist, but I would argue that it is already happening, including with software such as Microsoft Windows Server and Oracle, which you can now rent on Amazon’s AWS cloud for whatever length of time you want, and with no transaction costs.

As a result of these issues, if you are on a budget your best bet is the open source “R Project” which is a statistical and econometrics toolkit that is growing by leaps and bounds in its popularity. It runs in the Amazon cloud on both Linux and Windows. By combining R with a software technique known as MapReduce, you can easily split your program into portions that are run on multiple computers and have the results aggregated back elegantly. Here is a good example of using R with MapReduce by Stephen Barr, and another by Jeffrey Breen. I will be looking more into using more of this in my projects.

3. Cloud-based programming

Instead of running mathematical or analytical programs on your desktop, you can run it in the cloud. This works best if you can partition the problem into little chunks that can be worked upon independently. For example we use Perl for text processing of patent data. I know of people who code in Fortran/IMSL or C and generate binaries for optimization and numerical simulations. It is nice to be able to activate a dozen machines to process the data quickly instead of waiting a week for the results.

Other considerations

A side benefit of this approach is a quiet office. Some years back, I had a powerful workstation in my office with an 8-disk RAID array, multiple CPUs and dual power supply units. It was really noisy! Also, the cleaner had a habit of switching it off, ruining my calculations. Migrating my data analysis into the cloud allows me to now have a quiet and peaceful office, where I can think and write.

If you have any thoughts/comments about cloud based solutions, or know of other useful resources or tips, please share them in the comments below. Thanks.

Crime on Melbourne’s Rail System and the Abuse of Statistics

In today’s headline news, the Auditor General reports that the crime rate is falling on Melbourne’s train system (e.g., see here and here). The number of incidents has remained roughly the same, while the number of commuters has gone up. So, the Transport Minister says there is more crime but the trains are safer . The Auditor General seems concerned that “Victoria Police had failed to carry out promised pilot projects designed to minimise passenger perceptions of danger on railway stations and trains”. Shouldn’t they be catching crooks instead of manipulating customer perception?

Well, if we want to play this game of statistics, here are a few additional things to consider. Many crimes probably go unreported raising questions about data reliability, but lets put that aside for a moment. The Age reports that the crime rate on Melbourne’s trains is 33 per million passengers. I did a quick web search and wow! that figure seems to be pretty high. In Boston last year there were 827 major crimes on the MBTA system out of 350 million trips, making that only 2.2 crimes per million, way below the 33 in Melbourne. Even if we only consider assaults as “serious crimes”, which TheAge reports is 17% of the incidents, that works out to 5.6 assaults per million trips. I don’t have time to find data for lots of other cities, but it appears the New York subway carries over 10 million passengers/day and sees only 5.6 crimes per day, while on the Washington Metro it is about 4.35 per million riders. Is Melbourne’s train system really that safe?

A more sensible approach is to accept that crime happens on the train/subway systems of every major city, and to try and tackle the problem. Statistics could be used constructively. For a start, explore the distribution of criminal activity. Melbourne’s trains radiate outwards from the city center and some train lines go through neighborhoods that are much more crime ridden than others. So we should be looking at the rate of criminal activity on each line separately instead of the average across the whole system. Even better, identify the location and type of criminal activity in each line segment and station, and do data mining to figure out behavioural patterns. This should inform counter measures, e.g., by putting officers on duty, adding surveillance cameras, anticipating risky situations, etc. This way, the Police might even earn the respect of commuters, instead of just hoping to manipulate their perceptions of safety.