These days, everyone is hot to be “in the cloud.” But what exactly does this mean? Is your business “in the cloud” because you use Amazon’s S3 service for an off-site backup? In my opinion, being “in the cloud” means taking on a new approach to operations management. Specifically, there three areas where the cloud can be of particular value:
- Scalability (the most obvious)
- Cost management (OPEX)
- Architectural enhancements
Scalability
Endless scalability is one of the key selling points of the cloud, and rightfully so. Infrastructure as a service (IaaS) providers like Amazon EC2, NephoScale, and GoGrid all aim to provide one thing: virtual machines (“instances”) that bill by the hour, of varying CPU/RAM sizes. The ability to spin up new machines on-demand can provide companies with a lot of operational flexibility, which can result in a lot of clever and elegant uses for the cloud.
Platform as a service (PaaS) providers like Microsoft’s Azure allows a layer of abstraction between your code and server, and in its own way provides the same “bottomless well” of capacity. There are rumors of MSFT taking on the challenges of running a public IaaS platform, but to date nothing has been announced.
In either IaaS or PaaS, horizontal scalability is effectively free. That is to say, any scale factor that you can “throw more machines” at will feel right at home with IaaS, or with some development effort, PaaS. While there is some wiggle room with vertical scalability, the reality is that if any of your key scale factors are 100% dependent on vertical scale, you need to re-think the way your platform works.
Cost Management (OPEX)
Putting all the industry buzz aside, there’s a very real reason for companies to be interested in the cloud: cost management. Any company that has an online presence knows that servers aren’t being used at 100% capacity 100% of the time. The reality is that things like architecture, security, and company politics cause us to have machines running different services, all with unused capacity. Add in things like high availability (HA) and disaster recovery (DR), and the amount of idle computing resources can be doubled or tripled. The cloud paradigm allows operations professionals to more effectively manage their operating expenditures (OPEX), since machines (or instances) that are sitting idle during off-peak hours can simply be dismissed until the capacity is needed. Being “in the cloud” also allows you to keep a barebones platform running in another location for DR, whose capacity can be rapidly increased (minutes) in the event of a site failure at your primary location. As if that wasn’t enough, leveraging cloud environments for QA (heavily recommended if you’re running production out of the cloud), allows for much more flexibility than the traditional model of “cloning prod.”
Architectural Enhancements
The cloud movement represents a lot of things for business practices, operations management, etc. However, I think the most interesting application of the “cloud paradigm” is in platform architecture. Most importantly, you are no longer bound by the number of physical machines you have at any one time. Let’s take a basic example.
Assume you have a platform that collects some sort of data that your customers regularly look to you to report on. The nature of these reports are such that a month’s worth of data will take roughly 24 hours to process (reports are emailed to the requester so they’re not an inline operation). Shorter reports (weekly or bi-weekly) take a couple of hours. Based on historical trending, you see that approximately 25% of the report requests are for a month’s worth of data, and the remaining 75% are requests that finish within 2-3 hours. Most importantly, your company has established Service Level Agreements (SLA) with your customers that govern the average run-time of these reports.
Based on the above assumptions, you would need a number of high-power machines, since at worst you need the month long job to finish in 24 hours. Since the monthly reports represent the bulk of your computation (even if it isn’t the bulk of the requests), you need hardware that’s capable of keeping up with the most computationally expensive operation, not the least. Even worse, you need to have a number of these machines, since you want to minimize cases where customer A has started a monthly report, and subsequently blocks customer B’s request for a weekly report for more than a day. To take it a step further, these machines are also idle some percentage of time, causing very expensive assets to go under-utilized.
In the world of the cloud, you can very easily address a situation like this, and probably even provide a better end user experience. Assume your reporting workflow changed from “1) someone requests, 2) one of the reporting servers takes the request, and 3) crunches until done” to “1) someone requests, 2) a new instance is created to generate this report, and 3) that instance crunches that report until it’s done, after which the instance shuts down.” In this case, you have raised the throughput of your reporting application to be almost limitless, and you’ve done so while lowering your OPEX (which goes hand in hand with minimizing under-utilized resources). The new approach also allows you to create specific instance sizes for specific types of reports, so that you could “slow down” or “speed up” a report’s processing by starting either a bigger or smaller instance than is necessary.
The above is a very simple illustration of how one could leverage the unlimited capacity that cloud providers love to talk about. By making a very small tweak to the reporting workflow, we were able to 1) decrease monthly operating costs and 2) add more control over how quickly the reporting jobs were carried out.
As the world becomes more familiar with “the cloud,” I think we will start seeing an influx of really clever and elegant uses that leverage the endless capacity of the cloud. I, for one, can’t wait to see what you guys come up with


