Posts Tagged ‘cloud’

What is Cloud Computing?

February 16th, 2012

These days, everyone is hot to be “in the cloud.”  But what exactly does this mean?  Is your business “in the cloud” because you use Amazon’s S3 service for an off-site backup?  In my opinion, being “in the cloud” means taking on a new approach to operations management.  Specifically, there three areas where the cloud can be of particular value:

  1. Scalability (the most obvious)
  2. Cost management (OPEX)
  3. Architectural enhancements

 

Scalability

Endless scalability is one of the key selling points of the cloud, and rightfully so.  Infrastructure as a service (IaaS) providers like Amazon EC2, NephoScale, and GoGrid all aim to provide one thing:  virtual machines (“instances”) that bill by the hour, of varying CPU/RAM sizes.  The ability to spin up new machines on-demand can provide companies with a lot of operational flexibility, which can result in a lot of clever and elegant uses for the cloud.

Platform as a service (PaaS) providers like Microsoft’s Azure allows a layer of abstraction between your code and server, and in its own way provides the same “bottomless well” of capacity.  There are rumors of MSFT taking on the challenges of running a public IaaS platform, but to date nothing has been announced.

In either IaaS or PaaS, horizontal scalability is effectively free.  That is to say, any scale factor that you can “throw more machines” at will feel right at home with IaaS, or with some development effort, PaaS.  While there is some wiggle room with vertical scalability, the reality is that if any of your key scale factors are 100% dependent on vertical scale, you need to re-think the way your platform works.

 

Cost Management (OPEX)

Putting all the industry buzz aside, there’s a very real reason for companies to be interested in the cloud:  cost management.  Any company that has an online presence knows that servers aren’t being used at 100% capacity 100% of the time.  The reality is that things like architecture, security, and company politics cause us to have machines running different services, all with unused capacity.  Add in things like high availability (HA) and disaster recovery (DR), and the amount of idle computing resources can be doubled or tripled.  The cloud paradigm allows operations professionals to more effectively manage their operating expenditures (OPEX), since machines (or instances) that are sitting idle during off-peak hours can simply be dismissed until the capacity is needed.  Being “in the cloud” also allows you to keep a barebones platform running in another location for DR, whose capacity can be rapidly increased (minutes) in the event of a site failure at your primary location.  As if that wasn’t enough, leveraging cloud environments for QA (heavily recommended if you’re running production out of the cloud), allows for much more flexibility than the traditional model of “cloning prod.”

 

Architectural Enhancements

The cloud movement represents a lot of things for business practices, operations management, etc.  However, I think the most interesting application of the “cloud paradigm” is in platform architecture.  Most importantly, you are no longer bound by the number of physical machines you have at any one time.  Let’s take a basic example.

Assume you have a platform that collects some sort of data that your customers regularly  look to you to report on.   The nature of these reports are such that a month’s worth of data will take roughly 24 hours to process (reports are emailed to the requester so they’re not an inline operation).  Shorter reports (weekly or bi-weekly) take a couple of hours.  Based on historical trending, you see that approximately 25% of the report requests are for a month’s worth of data, and the remaining 75% are requests that finish within 2-3 hours.  Most importantly, your company has established Service Level Agreements (SLA) with your customers that govern the average run-time of these reports.

Based on the above assumptions, you would need a number of high-power machines, since at worst you need the month long job to finish in 24 hours.  Since the monthly reports represent the bulk of your computation (even if it isn’t the bulk of the requests), you need hardware that’s capable of keeping up with the most computationally expensive operation, not the least.  Even worse, you need to have a number of these machines, since you want to minimize cases where customer A has started a monthly report, and subsequently blocks customer B’s request for a weekly report for more than a day.  To take it a step further, these machines are also idle some percentage of time, causing very expensive assets to go under-utilized.

In the world of the cloud, you can very easily address a situation like this, and probably even provide a better end user experience.  Assume your reporting workflow changed from “1) someone requests, 2) one of the reporting servers takes the request, and 3) crunches until done” to “1) someone requests, 2) a new instance is created to generate this report, and 3) that instance crunches that report until it’s done, after which the instance shuts down.”  In this case,  you have raised the throughput of your reporting application to be almost limitless, and you’ve done so while lowering your OPEX (which goes hand in hand with minimizing under-utilized resources).  The new approach also allows you to create specific instance sizes for specific types of reports, so that you could “slow down” or “speed up” a report’s processing by starting either a bigger or smaller instance than is necessary.

The above is a very simple illustration of how one could leverage the unlimited capacity that cloud providers love to talk about.  By making a very small tweak to the reporting workflow, we were able to 1) decrease monthly operating costs and 2) add more control over how quickly the reporting jobs were carried out.

As the world becomes more familiar with “the cloud,” I think we will start seeing an influx of really clever and elegant uses that leverage the endless capacity of the cloud.  I, for one, can’t wait to see what you guys come up with :)

 

 

 

 

 

 

Converting instance-store instances to EBS instances (AWS EC2)

February 27th, 2010

A month or two ago, Amazon Web Services (AWS) announced that their EC2 instances will now be bootable from an elastic block store (EBS) volume.  This seems like a small change, but in fact has opened up a world of possibilities in the Elastic Computing Cloud (EC2).

EBS provides “block level storage volumes for use with Amazon EC2 instances.”  Keep in mind that prior to this, instances were limited to booting off of S3-backed Amazon Machine Images (AMI), which were not persistent images.  This meant 2 things:

  1. instance-store AMIs cannot be “stopped,” only rebooted or terminated
  2. rebooting an instance-store AMI reverted the instance back to the AMI defaults

Since EBS originally debuted as a high-performance attachable block device to a given instance, booting off of EBS AMIs has shown to be faster than the traditional instance-store boot as well.

Another benefit is that EC2 instances booted off of EBS volumes can be stopped, which effectively equates to shutting down a machine in the real world.  You are still responsible for the charges incurred while this instance is reserved, but all of the changes made to said instance will persist after you start it again.

The last major benefit of booting off the EBS volumes is that AWS has made it easy for you to create a new EBS AMI from a running EBS AMI.  In the Console, right clicking on an EBS instance will yield a new option, “Create Image (EBS AMI).”  This will basically shut down your instance, and proceed to generate a new EBS AMI from the contents on the disk of your instance.  This command seems to have a failure rate of ~40%, which can be a little frustrating.  I’ve found that if you put an instance into ‘stopped’ state before creating the EBS AMI, the process has a higher chance of success, but will still take anywhere from 20-45 minutes.

The rest of this article will focus on converting an instance-store AMI into an EBS AMI.

In order to perform this conversion, you will need to have an instance-store AMI that is the base OS you’d like to run (for the purposes of this article I used alestic’s Debian 5.0 base image), and access to EC2 via CLI as well as the portal (it’s all do-able from the CLI, but some of the tasks are a LOT easier and quicker through the web console).  The stuff I did in the console will be suffixed with [console],  and the stuff from CLI will be prefixed with #.

1) Booting an instance-store AMI – I executed the following to get a list of the images that fit my criteria (32bit, Debian, base install):

# ./ec2-describe-images –region eu-west-1 –all | grep -i lenny-base | grep i386

IMAGE   ami-b13a6bf4    alestic-32-us-west-1/debian-5.0-lenny-base-20090804.manifest.xml
IMAGE   ami-b33a6bf6    alestic-32-us-west-1/debian-5.0-lenny-base-20091011.manifest.xml

Note:  ec2-describe-images outputs way more data than this, above is formatted for brevity.

Once you have the AMI (newer is better, generally speaking), boot an instance with this AMI:

# ./ec2-run –region eu-west-1 -k $keypair  ami-b33a6bf6

Note: $keypair in this case is the name of keypair used to SSH into the server

2) Customizing the EBS volume – After the instance is up and running, look to see which availability zone the instance is in.  If the region is eu-west-1, the availability zone is going to be either eu-west-1a, or eu-west-1b.  In either case, find out which availability zone your instance is in, and then create a 10gb EBS volume is the same zone [console].

Why 10gb?  10gb is the maximum size for an S3-backed AMI, which makes a 10gb volume the largest any instance-store AMI will be.  Obviously EBS AMIs can exist on larger volumes (all the way up to 1tb in size), and you can easily do so once you have an EBS-backed AMI.

After the EBS volume has been created, attach it to the running instance [console].  Remember what you chose as the device name the volume identified itself as (/dev/sdf for example).

In a root shell on the instance:

# mkfs.ext3 /dev/sdf

# mkdir /mnt/target && mount /dev/sdf /mnt/target

# rsync -avHx / /mnt/target

# rsync -avHx /dev /mnt/target

# sync;sync;sync;sync && umount /mnt/target

The above commands did the following:

  • formatted the entire volume /dev/sdf as an extended 3 filesystem
  • created directory /mnt/target and mounted /dev/sdf at /mnt/target
  • rsync’d the root instance-store filesystem to the ebs volume
  • synchronized the /dev directory from the instance-store filesystem
  • flush all pending write ops, and unmount the EBS volume

3) Creating the AMI – At this point, you should have a 10gb EBS volume that shows available [console].  Simply right-click on the volume and create a snapshot for the volume [console].  Once the snapshot has completed, select from the list of available kernels on ec2 with the following command:

# ./ec2-describe-images -o amazon | grep -i xenu

Store the AKI for the kernel you want to use in the environment variable AKI:

# export AKI=aki-xxxxxxxx

Up to this point, we have booted an instance-store AMI, created an EBS volume, synchronized the instance-store filesystem with the EBS volume, and created a snapshot of the EBS volume.  The only thing we need to do now is associate an AKI with the snapshot, and register the end result as an AMI in the EC2 repository.

# ./ec2-register –region eu-west-1 -s $SNAP –name $NAME –description “$DESC” –architecture $ARCH \

–root-device-name /dev/sda1

Where $SNAP is the ID of the snapshot, $NAME is the name of your AMI, $DESC is a description of the AMI, and $ARCH is either i386 (for 32-bit) or x86_64 (for 64-bit).  The command will return an AMI, which will be yours to boot from once it finishes!

To track the progress of the AMI creation, you can do the following:

# watch -n 30 ‘./ec2-describe-images ami-xxxxxxxx’

This will execute the ec2-describe-images command for your new AMI every 30 seconds.  You can stop the command once you see that the AMI is in available state.

Now that you have an EBS-backed AMI, any further customizations you make to this image can be preserved forever by simply right-clicking on the instance [console], and clicking “Create Image (EBS AMI).”

Enjoy!

EC2 filesystem performance

January 26th, 2010

I’ve been posting lately about a tool named bonnie++, which will run a suite of tests against your linux filesystem to determine metrics in 3 important areas:  data read/write speed, max random seeks, and max metadata operations.  Last time I posted about profiling one of Linode.com’s “Linode 360″ instances.  In this article I will profile a m1.small instance on Amazon Web Services’ (AWS) Elastic Compute Cloud (EC2) service.

EC2 is the first legitimate cloud offering to market, and in many contexts they are the most developed, most robust, cloud provider.  However, there are many companies quickly ramping up their offerings (GoGrid, Voxel, Flexiscale, etc), if only in one or two datacenters (Voxel is the leader of the group, with locations in NY, Singapore, and Amsterdam).

The m1.small instance comes with the following specifications:

  • 1.7gb RAM
  • 1 EC2 Compute Unit
  • 160gb instance storage
  • “moderate” I/O performance

While these specs are mostly useful in comparison to other EC2 instance sizes, the performance of this particular size will provide a useful benchmark for baselining EC2 performance.  Since this is the smallest instance, I’m assuming “moderate I/O performance” means that it’s as bad as it gets.

From the earlier post, we use the following command to invoke bonnie++:

# bonnie++ -u 0 -r 1700 -s 34000 -n 256 -b -d /

The above commanded failed to run, claiming that the filesystem was out of space.  Checking the filesystem, I see that I only have /dev/sda1, which is 15gb and mounted at /.  Since real-world testing involves what the customer actually gets and not what the marketing literature says, I adjusted the -S parameter to 3400, which should easily outpace/outpage the 1.7gb of memory in my instance.

Invoking bonnie++ with the new parameter yields me with the following result (click for larger image):

Server ip-10-226-125-238 was able to

  • sustain ~52MBS at 6% CPU for sequential block writes
  • sustain ~64MBS at 1% CPU for sequential block reads
  • max out at 939.8 random seeks per second
  • sequentially create 127 files per second
  • randomly create 174 files per second
  • sequentially read metadata from 158,584 files per second at 39% CPU
  • randomly read metadata from 203,851 files per second at 41% CPU
  • sequentially delete 121 files per second
  • randomly delete 158 files per second

As the results from running bonnie++ in various providers pile up, I will compile them into a spreadsheet which will (hopefully) ultimately shed some light on the performance boundaries of various VPS and cloud providers.