What is cloud computing?

What is cloud computing? That’s the question addressed by Matei Zaharia, an assistant professor at MIT, in a session of the online course “Tackling the Challenges of Big Data,” which runs through March 17. At a high level, he said, cloud computing involves computing resources—compute cycles, storage, and software—that are available on demand, with pricing on a pay-as-you-go basis.

Cloud computing offers users several benefits, he said. First, you can get started quickly. Second, you are outsourcing functions such as administration, reliability, disaster recovery, and security. Whereas you might have to spend $100,000 per year for an administrator to handle your 100 servers, Amazon can spend that same amount to administer 10,000 servers.

Third, you’ll probably pay lower costs—not only do you pay-as-you-go, you can also leverage the economies of scale of the cloud provider. Amazon can probably get better prices on servers and storage than you can. Of course, Amazon does want to make a profit, so the company won’t pass on to you all of its savings. But, Zaharia said, even with its margin, Amazon might offer you a better price than you could get on your own for similar resources.

And finally, the cloud offers elasticity—you can acquire large amounts of infrastructure quickly to accommodate peak loads and return it when you’re finished using it. The cloud model supports variable utilization and avoids the risk of under- or overprovisioning. Cloud providers today typically offer granularity of one hour. Cloud providers can avoid their own under- or overprovisioning by employing statistical multiplexing among cloud client loads, and they can use their cloud resources for internal loads when client demands are low.

A drawback of cloud computing, Zaharia said, is that moving large datasets over the Internet is expensive and can take a lot of time. Moving 10 TB of data over a 45-Mb/s T3 line would take about 20 days. An alternative is to ship physical disks (such as with Amazon’s Import/Export service). In addition, cloud companies can help by bringing together public scientific and government datasets and let their clients access them.

Yet another drawback relates to privacy and security—especially when legislation requires that you maintain strict control of your data. The healthcare HIPPA laws and PCI DSS (Payment Card Industry Data Security Standard) are examples. The issue can be further complicated if your cloud provider is in a different jurisdiction than you are. Approaches to improving security include encryption, key rotation, fine-grained access controls, and two-factor authentication. Research is ongoing on topics such as homomorphic encryption (in which operations on encrypted data yield an encrypted result, with the cloud provider never seeing the unencrypted data), order-preserving encryptions (supporting greater-than and less-than comparisons on encrypted data), and search on encrypted data.

Availability is also an issue. Most cloud providers distribute operations across regions (location diversity), but what if your cloud provider goes out of business? Zaharia said there is pressure on cloud providers to interoperate well so clients can use multiple providers, and third-party services are emerging to manage this.

And a final drawback is lock-in (both interface lock-in and data lock-in). It can be difficult to migrate from one cloud provider to another. Approaches here include using open-standard APIs or wrappers over proprietary APIs.

Zaharia then discussed associativity: you’ll generally pay a cloud provider that same for 100 servers running for an hour as you would for one server running 100 hours. If your application lends itself to parallelism, you can take advantage of associativity to get results fast.

When purchasing cloud services, you can choose the level of abstraction you need. Zaharia cited levels of abstraction as defined by NIST:

Software as a Service (SaaS)—user-facing applications that work as if the software were installed on the user’s local machine. (Tableau Online end-to-end visualization and reporting software is an example.)
Platform as a Service (PaaS)—developer-facing services such as a web application host or database on which you build your own applications. (Amazon’s Relational Database Service, or RDS, is an example.)
Infrastructure as a Service (IaaS)—raw computing resources such as virtual machines, which look like an x86 processor, or virtual disks. (Amazon’s Elastic Compute Cloud, or EC2, is an example.)

Zaharia concluded by noting that in the 1900s, large companies generated their own electricity rather than try to use some kind of pubic grid. Just as such companies have now moved to buying power from the public grid, so too might they ultimately outsource computations to the cloud.