Achieving computer security is a hard problem because it’s a negative goal, according to MIT Professor Nickolai Zeldovich in the online course “Tackling the Challenges of Big Data,” which runs through March 17. You need to ensure that adversaries can neither obtain confidential data nor corrupt that data, despite them having many possible avenues of attack. They can monitor the connections between an applications server and database server, for example, or exploit a software bug. They may get hold of a disk when a database server is decommissioned. They may compromise the human who administrates the database server. A government agency might issue a subpoena.
Encryption holds promise in addressing all these avenues of attack, Zeldovich said. He described a proxy storing a key positioned between a trusted application server and an untrusted database server, with the proxy sending only encrypted data to database server and receiving encrypted results in return.
Unfortunately, the database server may have to operate on the encrypted data—running SQL queries, generating aggregate statistics, and selecting certain records. Fortunately, methods exist to allow the database server to perform operations on encrypted data and generate encrypted results without ever seeing the plain-text data.
Fully homomorphic encryption is a theoretically promising approach, he said, but it is not now practical—slowing processing down by a factor of 109. Zeldovich and his colleagues at MIT have developed a more promising scheme called CryptDB.
CryptDB, he said, trades off some generality and security to achieve better performance. Rather than encrypt an entire table, CryptDB encrypts cells individually.
He presented an example of a table representing a company’s employees’ rank, name, and salary, with cell data encrypted using randomized encryption, which provides the highest level of security possible: semantic security. With randomized encryption, however, the database cannot respond to a query asking for every employee making $100,000. Each entry corresponding $100,000 in the table would have a different encrypted string.
Deterministic encryption overcomes this problem—with deterministic encryption, if you encrypt the same plain text data twice, you’ll get the same encrypted string. The database can now return all rows for which salary corresponds to $100,000. The database cannot, however, respond to requests for all rows representing salaries of $100,000 or greater. For that, another encryption scheme called order-preserving encryption (OPE) is necessary. With OPE, the encrypted string representing $100,000, for example, will be a larger number of the encrypted string for $60,000 but a smaller number than the encrypted string for $200,000.
To maximize security within the OPE scheme, Zeldovich said, the scheme should offer indistinguishability—that is, an adversary should not be able to distinguish the encrypted values of 1, 3, 2 from the encrypted values of any other values with the same order, such as 3, 999, 4. A construction called mutable OPE, or mOPE, involving the establishment of a binary search tree in the untrusted server with the help of the client, helps ensure this property.
Zeldovich cited other encryption schemes, including ones that permit keyword searches and homomorphic schemes that allow some mathematical operations—addition or multiplication, for instance, but not both. Problems arise in a query to return all records with (a + b) > c. Pallier encryption would permit the operation a + b, but it’s not order-preserving, so the comparison with c would not be possible.
In addition he said, the more functional an encryption scheme tends to be—that is, the more types of operations it permits—the less secure it tends to be. And further, when developing an encryption scheme, you may not know a priori what types of operations your database might be called on to perform on your encrypted data.
What Zeldovich called “onions of encryption” can be used to optimize security and functionality. At the center of the onion would reside your data encrypted with a relatively weak scheme, such as order-preserving encryption, coded with key k1. The OPE encrypted value might then be further encoded with deterministic encryption, coded with k2. And finally, the deterministically encoded value could be encrypted with a randomized semantically secure encryption scheme with key k3.
Then, your semantically secure data can be sent to the database server without risk. Should the server be required to find all salaries equaling $100,000, for instance, the client can submit k3, peeling off one layer of the onion and exposing the deterministically encrypted value. And should a comparison be required, the client can submit k2 to expose the order-preserving encrypted layer. An administrator can set thresholds—for example, not permitting exposure of the OPE layer for sensitive data like social security numbers. Operations requiring OPE would return an error message, and the query would have to be run on the trusted client.
Zeldovich and his colleagues used a prototype of CryptDB to try out these concepts in a real-world environment. They found that most data did not need to be decoded from the semantically secure value. And most data that needed to be decoded down to the OPE level turned out not to be very sensitive, consisting of items such as time stamps and message IDs. And the performance penalties of CryptDB, he said, were quite modest.
He also discussed partitioning to split execution between an untrusted server and trusted proxy, and you can push encryption all the way to a user’s browser. “So together, all of these techniques should help you protect confidential data in a large data system from a wide range of attacks,” he concluded.
See these related posts on the online course “Tackling the Challenges of Big Data”:
- Big data challenges: volume, velocity, variety
- Commuting via rapidly mixing Markov chains
- GPUs drive MIT’s MapD Twitter big-data application
- Emerging tools address data curation
- What is cloud computing?
- Big data impels database evolution
- Distributed computing platforms serve big-data applications
- NewSQL takes advantage of main-memory databases like H-Store
Read these other posts on big-data topics: