More data than you can imagine

Whether patients are benefitting to the full degree that they could from the increasing use of computers in medicine is doubtful. To some extent, this is because the role and scale of medically related computing continue to evolve.

This is easily seen in the HealthIT.gov definition of meaningful use as it applies to electronic health records: “Meaningful use is using certified electronic health record (EHR) technology to improve quality, safety, [and] efficiency and reduce health disparities; engage patients and family; improve care coordination and population and public health; [and] maintain privacy and security of patient health information. Ultimately, it is hoped that the meaningful use compliance will result in better clinical outcomes, improved population health outcomes, increased transparency and efficiency, empowered individuals, [and] more robust research data on health systems.”

More complete care promised by EHR adoption
Courtesy of HealthCatalyst ⁴

To benefit from the Medicare and/or Medicaid incentive programs that have been established to encourage meaningful use, eligible professionals and hospitals must meet a series of objectives. As the government’s health IT website explained, these objectives are evolving in three stages over five years: 2011-2012 Stage 1, Data capture and sharing; 2014 Stage 2, Advance clinical processes; [and] 2016 Stage 3, Improved outcomes.

Some of the hoped-for improved outcomes are related to adoption of best practices, which is being encouraged by changing Medicare reimbursement from the traditional fee-for-service model to a value-based one. In a recent article, Health and Human Services (HHS) Secretary Sylvia Burwell was quoted as saying that by 2018, 90% of traditional Medicare payments should be transformed into value-based reimbursement. Burwell said, “Whether you are a patient, a provider, a business, a health plan, or a taxpayer, it is in our common interest to build a healthcare system that delivers better care, spends healthcare dollars more wisely, and results in healthier people…. We believe these goals can drive transformative change, help us manage and track progress, and create accountability for measureable improvement.”¹

Burwell also stressed the role of incentives, care quality improvement tactics, and health IT and data analytics adoption in accomplishing the HHS goals. She explained, “We are dedicated to using incentives for higher-value care, fostering greater integration and coordination of care and attention to population health, and providing access to information that can enable clinicians and patients to make better-informed choices.”¹

Big and not-so-big data

At the level of an individual doctor or hospital, attempting to demonstrate meaningful use can be complicated by incompatibilities among current computer and/or software systems. Taking a much broader view, the HHS needs high EHR adoption before trends or exceptions that relate to the wider population can be accurately determined. Clearly, the scales of the two undertakings are very different.

According to one report, “Data from the U.S. healthcare system alone reached, in 2011, 150 exabytes [150 E18]. Kaiser Permanente, the California-based health network, which has more than nine million members, is believed to have between 26.5 and 44 petabytes [26.5 to 44 E15] of potentially rich data from EHRs, including images and annotations.”²

These comments primarily relate to the volume aspect of big data. In addition, velocity and variety also often are cited as characteristics. As the U.S. healthcare industry adopts EHRs, the volume of new data will ramp up quickly. And, increased use of modern imaging technologies is significantly affecting data variety as well as volume.

A General Electric white paper proposed adoption of vendor-neutral archives for images as a way to improve workflow and information sharing. The paper stated, “The growth in imaging volume comes not only from the expected areas of care, such as radiology and cardiology, but also from disciplines that are beginning to integrate digital imaging into their care protocols, such as dermatology, hematology, pathology, and ophthalmology. Each of these ‘ologies’ introduces its own set of storage and management requirements and frequently incorporates different technology vendors, formats, and standards.”³

A HealthCatalyst column attempted to put big and not-so-big data into perspective. It stated, according to Hortonworks CEO Eric Baldeschwierer, “Yahoo! has 42,000 nodes in several different Hadoop clusters with a combined capacity of about 200 petabytes [200 E15].” In contrast, the author stated, “Most healthcare providers don’t have big data. A hospital CIO I know plans for future storage growth by estimating 100-MB of data generated per patient, per year. A large 600-bed hospital can keep a 20-year data history in a couple hundred terabytes [200 E12].”⁴

Computing evolution

Accessing and analyzing truly huge amounts of data of various types are challenging for traditional relational databases. In a Cloudera white paper, founder and CEO Mike Olson discussed the problem. He said, “… the variety and complexity of available data are exploding. The simple, structured data managed by legacy relational database management system (RDBMS) offerings still exists but is generally dwarfed by the volume of log, text, imagery, video, audio, sensor readings, scientific modeling output, and other complex data types streaming into data centers today.”⁵

Google faced a similar database constraint around 2000 that prompted development of MapReduce, “a programming model and an associated implementation for processing and generating large data sets,” as described by the company’s Dean and Ghemawat in a 2004 paper.6 MapReduce uses a divide-and-conquer approach that first splits the input files into M manageable 16-MB to 64-MB pieces. Each piece is copied to several networked computers with associated local disk drives.

A master computer assigns M map tasks to idle computers within the cluster and monitors progress. Each task outputs key/value pairs from the data—for example, the number of times that a certain word appears. These intermediate results are combined by R reduce machines, also managed by the master. When all the reduce machine outputs are complete, the master returns control to the user program, which then has available a final output file with the MapReduce results.

By copying the M data pieces to typically three computers, the master can accommodate hardware failures by reassigning tasks. An underlying theme of MapReduce is to take advantage of low-cost computing hardware with commercial-level reliability to implement massively parallel computation. Hadoop is a separate Apache Software Foundation program modeled on the MapReduce theme. Several vendors, including Cloudera, offer auxiliary capabilities in addition to the free Hadoop core.

Cloudera’s Olson added, “More data in and of itself generates higher quality information. If a person can access the data, patterns emerge that are not apparent with less data. In a world where information separates winners from losers, successful organizations will distinguish themselves by how they work with data. The need to capture, process, and analyze information at scale, quickly, is a defining characteristic of the post-9/11 world.”⁵

Of course, even given Hadoop’s very rapid rise in popularity, traditional database systems will not disappear soon. Michael Stonebraker, a long-time database company veteran, along with six colleagues from business and academia, undertook a comparison of Hadoop and two parallel SQL database management systems, Vertica and an unnamed DBMS-X product from a major relational database vendor.⁷

In general, Hadoop was much easier to set up and run than the databases. On the other hand, the databases both performed at least two times faster than Hadoop when averaged across five tasks using a 100-node cluster. The authors commented that they did not have to construct a schema or register user-defined functions as was required for the databases. However, they did have to modify the MapReduce code and retest to ensure it worked correctly when the number of benchmark tests was increased, adding new columns to the data set.

The paper concluded, “… there is a lot to learn from both kinds of systems. Most importantly is that higher level interfaces, such as Pig [and] Hive, are being put on top of the MR [MapReduce] foundation, and a number of tools similar in spirit but more expressive than MR are being developed, such as Dryad and Scope…. Hence, the APIs of the two classes of systems are clearly moving toward each other. Early evidence of this is seen in the solutions for integrating SQL with MR offered by Greenplum and Asterdata.”

Further qualifying approaches to big-data problems, Cloudera’s Olson said, “… algorithms must run well on the shared-nothing distributed infrastructure of Hadoop. If a processing job needs to communicate extensively among servers in the cluster, MapReduce is often a poor platform choice. At least in the map phase of any job, the best algorithms are able to examine single records, or a small number of adjacent records stored on the same DataNode, to compute a result.”⁵

For situations that can benefit from MapReduce, Google’s Dean and Ghemawat cited three observations they made after many successful applications of the technique. “First, restricting the programming model makes it easy to parallelize and distribute computations and to make such computations fault-tolerant. Second, network bandwidth is a scarce resource. Locality optimization allows us to read data from local disks, and writing a single copy of the intermediate data to local disk saves network bandwidth. Third, redundant execution can be used to reduce the impact of slow machines and to handle machine failures and data loss.”⁶

Big data and genome sequencing

In the medical research area, the U.K.’s Wellcome Trust Sanger Institute plays a prominent role in DNA sequencing. Rather than working with so-called long reads consisting of thousands of base pairs, as was done in early DNA sequencing work, around 2008 it was realized that using more but shorter reads could speed up the process.⁸

Millions of images are generated by each of the genome analyzers. In 2009, Sanger had at least 35 analyzers producing about 100 terabits of data per week. According to Wood and Blackburne, “Each sequencing instrument is attached to a PC, which controls the instrument and provides a temporary staging area for the image data streaming from the machine. The data is moved from the attached instrument PC through a 10-Gb pipe to a larger storage array: 400 Tb of Lustre-managed EVA storage. This is networked to a 1,000-node cluster, which performs the primary analysis and image alignment duties on raw images. …The sequence data is then passed through quality control steps, which again run on the sequencing analysis cluster, and check for low sequencing yield, high levels of unknown bases, or low complexity sequence, all of which are telltale signs for sequencing errors.”⁸

For many of Sanger’s projects, the results are available for free. As Wood and Blackburne concluded, “This represents a fantastic resource for biomedical researchers and continues the best traditions of free and open data access of the original human genome project.” The Sanger website notes that today the Institute’s data center shares nearly 10,000 “cores of compute predominantly in blade format and approximately 10 petabytes [10 E15] of raw storage capacity” with the neighboring European Bioinformatics Institute.

References

Bresnick, J., “90% of Medicare Will Be Value-Based Reimbursement by 2018,” HealthIT Analytics, Jan 27, 2015.
Raghupathi, W., and Raghuipathi, V., “Big data analytics in healthcare: promise and potential,” Health Information Science and Systems, 2014.
Sharma, P. and White, L., “How does your image management approach stack up?” GE Healthcare, 2012.
Crapo, J., “Hadoop in Healthcare: a No-nonsense Q and A,” HealthCatalyst.
Olson, M., “HADOOP: Scalable, Flexible Data Storage and Analysis,” IQT QUARTERLY, Spring 2010, Vol. 1, No. 3, pp. 14-18.
Dean, J., and Ghemawat, S., “MapReduce: Simplified Data Processing on Large Clusters,” OSDI’04: Sixth Symposium on Operating System Design and Implementation, December 2004.
Pavlo, A., et al, “A Comparison of Approaches to Large-Scale Data Analysis,” ACM SIGMOD/PODS Conference, June/July 2009.
Hammerbacher, J., and Segaran, T. eds., Beautiful Data, O’Reilly Media, 2009, pp. 243-258.