Cloud computing and the “me versus you” problem

A Personal Workspace

I’m finally getting back to blogging after spending a couple of months traveling then catching up with work. This week I was invited to speak at a “guru forum” of managers and academics who work in information technology. Among the many issues that were discussed, two conflicting trends were identified. On the one hand many corporate organizations are moving towards cloud services and all-in-one outsourced solutions (Oracle, SAP, IBM, …). On the other hand individuals are moving towards a “bring your own” model, bringing their own computers, e-books, cellphones, iPads and other devices to their workplaces. With the advent of smartphones and social media platforms such as Facebook, computing is becoming more consumer-centric and primarily a means for social interaction, rather than just a tool for specific tasks like word precessing and accounting.

These opposing trends create a disconnect at the workplace between the ability of firms to manage and control information (especially proprietary information) versus the desire to give employees flexibility and freedom in choosing the tools they really want to use. My view is that the trend towards consumer-centric computing will dominate the other paradigm. There is no turning back the preferences of modern information workers who grew up with their iPads, Android phones and Kindles. Companies should embrace rather than fight the trend.

How do we solve this “me versus you” problem? i.e., organizing information on multiple devices in a way that separates private from work and other shared information in an easy but manageable way? Existing solutions are unsatisfactory because they do not adapt to the different and changing contexts that individuals find themselves in. Companies like Apple, SAP and Oracle take a fully integrated approach, allowing you to run everything on their software and leveraging their own cloud solutions, treating each device as a client. Bringing this to the extreme, you can run entire virtual machines from your own device with everything hosted on the service provider, such as via Amazon S3 or OnLive. Unfortunately this is often an all-or-nothing proposition, so while it creates separate contexts, the operation across contexts is not seamless. You’re basically running separate computers (or syncing to separate clouds) from within your own device, and it is slow and clunky to inter-operate between them.

In contrast, other firms like Dropbox provide services that integrate into your existing applications and folders, but end up being highly fragmented especially when it comes to setting permissions and giving access. Each application and each collaborator needs to be authenticated, so coordination can be a hassle. This week my colleague tried to set up a shared Dropbox folder for the faculty members at our school, and it seemed a lot more of a hassle than it needed to be, especially the bit about inviting each user and trying to get them to actually sign up to the Dropbox cloud.

The good news is that the solution of the “me versus you” problem is closer at hand than many might think. The architecture for such a solution already exists in products like Google Circles and VMware but is not yet pervasive. Here’s an example of what one such solution might look like. At present most operating systems support multiple workspaces, but for now they are all tied to the same set of permissions and applications. Well, imagine a future in which each workspace on your device is authenticated to different sets of applications and clouds. For example, your device could include a personal workspace that authenticates to Apple and Dropbox and which contains your personal files, apps and Facebook page. A second workspace could authenticate to your office, with the IT system at your office determining what apps and cloud services are made available and which of these you can transfer across workspaces. A third workspace could be one created by your friend so that when you visit her house, her workspace would appear along with some of the data and services from her home network that your friend is willing to share with you.

A small number of us already have something close to this setup running on our computers by using multiple virtual machines that are active simultaneously. But it isn’t the same thing. I’m thinking of something with much more integration than is available in existing virtual machines and with much less of the “heavy machinery” that is needed to support multiple operating systems on the same machine (the action is in the data and apps, not in the operating system itself anymore). I also have in mind something more dynamic, for example with the ability to seamlessly add or remove workspaces when the context around a person changes. In the example above, if your friend defines a workspace that is shared with you when you visit her, that workspace should actually exist in a virtual sense, and it should slide on and off your various devices in a consistent manner including your smartphone, iPad and notebook computer.

Granted, the ideal solution in my head might be a bit far-fetched. However I suspect it will become prevalent in the next several years. I don’t know what it would cost in terms of implementation and adoption. However, the fundamental issues are of great concern among industry practitioners such at those attending the IT Guru forum, so I suspect that over the next few years entrepreneurial firms will end up exploring solutions and frameworks along these lines.

Cloud based Econometrics and Statistics Software

While the rest of the world was busy with Apple’s iCloud, I spent the past few weeks working on a large-scale empirical project.

In the process I learnt a few things about cloud-based options for statistics and econometrics. The situation has developed quite a bit since Robert Grossman’s earlier post on using Amazon’s cloud for this purpose. Amazon now has a browser-based graphical dashboard to easily manage your cloud-based machines, instead of relying upon command line tools.

In my view, there are three relevant areas that are at different stages of cloud-readiness for empirical economists and statisticians:

1. Databases and datasets

Cloud based solutions are great especially for large datasets that require scaleability. They are also good for research projects that require multiple people to access them (e.g., if your project involves multiple coauthors or research assistants).

For simple projects with small datasets, a shared spreadsheet on Google Docs should suffice. For larger datasets, one good option is Amazon RDS, which is price-competitive and offers both SQL and Oracle databases; it is easy to maintain and backup. Another option is Microsoft Azure. We use Postgresql and Ubuntu on EC2 for analysing patent data.

One advantage of cloud based databases is that the technology is now mature. Demand is driven by many other business and scientific applications. We therefore benefit from positive knowledge spillovers. It is relatively easy and inexpensive to hire an RA with a computer science background to build an SQL-based dataset. A second advantage is that a number of other research databases are now starting to appear in the cloud making them easy to interface with cloud-based programming. This includes data on patents, genes, the US Federal Reserve and Census data. The number of research databases in economics and the social sciences is still small, but growing.

2. Regression and statistical software

This is where it is disappointing. Most of the popular software packages are not widely affordable for cloud use, including Stata and SPSS. You hit licensing snags. A small number of private service providers bridge this gap by offering High Performance Computing (HPC solutions), e.g., for Mathematica, but they are pretty expensive (at least to an academic researcher). Matlab will work but requires a ‘distributed server’ license that will cost a fortune. In general, these software companies want to sell you a 2-core or 4-core license that will run year-long on that computer on your desktop. What some of us need instead is a license that will run on 64 cores across 16 machines for just one month during which we are doing intensive number-crunching. More importantly, we want that license to be easy to transact, not to go through a complicated application and registration process. You might think this sort of licensing doesn’t exist, but I would argue that it is already happening, including with software such as Microsoft Windows Server and Oracle, which you can now rent on Amazon’s AWS cloud for whatever length of time you want, and with no transaction costs.

As a result of these issues, if you are on a budget your best bet is the open source “R Project” which is a statistical and econometrics toolkit that is growing by leaps and bounds in its popularity. It runs in the Amazon cloud on both Linux and Windows. By combining R with a software technique known as MapReduce, you can easily split your program into portions that are run on multiple computers and have the results aggregated back elegantly. Here is a good example of using R with MapReduce by Stephen Barr, and another by Jeffrey Breen. I will be looking more into using more of this in my projects.

3. Cloud-based programming

Instead of running mathematical or analytical programs on your desktop, you can run it in the cloud. This works best if you can partition the problem into little chunks that can be worked upon independently. For example we use Perl for text processing of patent data. I know of people who code in Fortran/IMSL or C and generate binaries for optimization and numerical simulations. It is nice to be able to activate a dozen machines to process the data quickly instead of waiting a week for the results.

Other considerations

A side benefit of this approach is a quiet office. Some years back, I had a powerful workstation in my office with an 8-disk RAID array, multiple CPUs and dual power supply units. It was really noisy! Also, the cleaner had a habit of switching it off, ruining my calculations. Migrating my data analysis into the cloud allows me to now have a quiet and peaceful office, where I can think and write.

If you have any thoughts/comments about cloud based solutions, or know of other useful resources or tips, please share them in the comments below. Thanks.