Cloud based Econometrics and Statistics Software

While the rest of the world was busy with Apple’s iCloud, I spent the past few weeks working on a large-scale empirical project.

In the process I learnt a few things about cloud-based options for statistics and econometrics. The situation has developed quite a bit since Robert Grossman’s earlier post on using Amazon’s cloud for this purpose. Amazon now has a browser-based graphical dashboard to easily manage your cloud-based machines, instead of relying upon command line tools.

In my view, there are three relevant areas that are at different stages of cloud-readiness for empirical economists and statisticians:

1. Databases and datasets

Cloud based solutions are great especially for large datasets that require scaleability. They are also good for research projects that require multiple people to access them (e.g., if your project involves multiple coauthors or research assistants).

For simple projects with small datasets, a shared spreadsheet on Google Docs should suffice. For larger datasets, one good option is Amazon RDS, which is price-competitive and offers both SQL and Oracle databases; it is easy to maintain and backup. Another option is Microsoft Azure. We use Postgresql and Ubuntu on EC2 for analysing patent data.

One advantage of cloud based databases is that the technology is now mature. Demand is driven by many other business and scientific applications. We therefore benefit from positive knowledge spillovers. It is relatively easy and inexpensive to hire an RA with a computer science background to build an SQL-based dataset. A second advantage is that a number of other research databases are now starting to appear in the cloud making them easy to interface with cloud-based programming. This includes data on patents, genes, the US Federal Reserve and Census data. The number of research databases in economics and the social sciences is still small, but growing.

2. Regression and statistical software

This is where it is disappointing. Most of the popular software packages are not widely affordable for cloud use, including Stata and SPSS. You hit licensing snags. A small number of private service providers bridge this gap by offering High Performance Computing (HPC solutions), e.g., for Mathematica, but they are pretty expensive (at least to an academic researcher). Matlab will work but requires a ‘distributed server’ license that will cost a fortune. In general, these software companies want to sell you a 2-core or 4-core license that will run year-long on that computer on your desktop. What some of us need instead is a license that will run on 64 cores across 16 machines for just one month during which we are doing intensive number-crunching. More importantly, we want that license to be easy to transact, not to go through a complicated application and registration process. You might think this sort of licensing doesn’t exist, but I would argue that it is already happening, including with software such as Microsoft Windows Server and Oracle, which you can now rent on Amazon’s AWS cloud for whatever length of time you want, and with no transaction costs.

As a result of these issues, if you are on a budget your best bet is the open source “R Project” which is a statistical and econometrics toolkit that is growing by leaps and bounds in its popularity. It runs in the Amazon cloud on both Linux and Windows. By combining R with a software technique known as MapReduce, you can easily split your program into portions that are run on multiple computers and have the results aggregated back elegantly. Here is a good example of using R with MapReduce by Stephen Barr, and another by Jeffrey Breen. I will be looking more into using more of this in my projects.

3. Cloud-based programming

Instead of running mathematical or analytical programs on your desktop, you can run it in the cloud. This works best if you can partition the problem into little chunks that can be worked upon independently. For example we use Perl for text processing of patent data. I know of people who code in Fortran/IMSL or C and generate binaries for optimization and numerical simulations. It is nice to be able to activate a dozen machines to process the data quickly instead of waiting a week for the results.

Other considerations

A side benefit of this approach is a quiet office. Some years back, I had a powerful workstation in my office with an 8-disk RAID array, multiple CPUs and dual power supply units. It was really noisy! Also, the cleaner had a habit of switching it off, ruining my calculations. Migrating my data analysis into the cloud allows me to now have a quiet and peaceful office, where I can think and write.

If you have any thoughts/comments about cloud based solutions, or know of other useful resources or tips, please share them in the comments below. Thanks.