Machine learning can help build predictive models that can be useful in medical analytics, according to MIT Professor John Guttag, who makes use of data from sources ranging from signals (ECG, for example) to biomarkers and video-based monitoring.
The good news is that the capacity to gather medically significant data is growing quickly, he said, during a lecture as part of the online course “Tackling the Challenges of Big Data.” Better instruments in the form of MRI machines and ambulatory monitors, for instance, are generating more data per patient, and growing storage capacity enables that data to be saved. In addition, aggregated databases of medical information are becoming available.
The bad news, he said, is that clinicians are facing an onslaught of new medical data and can’t keep up. They are using analytical techniques that don’t scale and that are designed to test hypotheses rather than uncover new knowledge. And clinical studies are time-consuming and prohibitively expensive.
A deterministic bottom-up approach might be used to derive answers—beginning with molecules and cells and working upwards through organ and system biology. It’s good that people are working on biology, Guttag said, but in our lifetimes answers will be derived from empirical data. He cited in particular “found data”—taking data from sources ranging from clinical studies to hospital records that someone else acquired and putting it to new use.
He summed up the big data challenge: it stems from tens of millions of patients, with incomplete, ambiguous, and often incorrect data in multiple modalities (signals, lab results, images, and natural language). One saving grace is that human physiology and medical practice change slowly, he said, so what we learn can be of long lasting value.
Guttag then introduced the concept of “computationally generated biomarker,” which he defined as a biomarker (a characteristic that can be objectively measured) generated by applying computation to medical data. He applied the concept to cardiovascular risk stratification for patients who have experienced an acute coronary syndrome—15% to 20% of whom can be expected to suffer a cardiac-related death within four years.
One treatment is the implantable cardiac defibrillator. He asked, do ICDs save lives? He presented data showing that they do reduce cardiac-related deaths, but in the cases he studied they seemed to increase deaths from other causes—on average, they don’t appear to help. In fact, he said, 90% of defibrillators that get implanted are never energized.
He emphasized that doesn’t mean ICDs shouldn’t be used—what’s needed is a better way to identify patients for whom they would be most effective. One approach to this involves morphological variability, in which a machine examines ECG data at a level of detail that a cardiologist couldn’t read visually. Experiments conducted on a repurposed dataset did show that patients whose ECGs showed high morphological variability were at much higher risk of death within a year—indicating that the technique is a good predictor.
He then turned his attention to using machine learning to develop a hospital-specific data-driven method to predict the stubbornly prevalent healthcare-associated infections (HAIs), but he took a brief detour to define machine learning as the process of “automating automation…getting computers to program themselves.” With traditional programming, he said, we feed a program and some data into a computer and get an output. With machine learning, we feed data and an expected output into a computer and get a program.
Applying machine learning to the problem of HIAs, he said, involves using data-driven models to study the temporal aspects of the problem and to use “transfer learning” to incorporate specific features from one hospital while leveraging data from other hospitals.
He presented an example of predicting the probability of a patient acquiring a C diff infection by calculating risk using an evolving risk profile—not just static data collected at the time of admission. Variables considered included medications taken at the hospital, patient locations within the hospital throughout the stay, procedures performed, lab results obtained, and hospital staff members encountered. The temporal approach outperforms a snapshot approach at predicting HAIs and limits false positives.
He then cited a variety of fields that data analysis has transformed—including genetics, finance, and sports. He concluded by saying, “Well, it’s now time for analytics to transform medicine. I believe firmly that over the next decade or so, computer scientists will do more to change medicine than anybody else on earth.”
See these related posts on the online course “Tackling the Challenges of Big Data”:
- Big data challenges: volume, velocity, variety
- Commuting via rapidly mixing Markov chains
- GPUs drive MIT’s MapD Twitter big-data application
- Emerging tools address data curation
- What is cloud computing?
- Big data impels database evolution
- Distributed computing platforms serve big-data applications
- NewSQL takes advantage of main-memory databases like H-Store
- Onions of encryption boost big-data security
- Lecture addresses scalability in multicore systems
- User interfaces can clarify, or obscure, big data’s meaning
- It’s a small world, but with big data
- Sampling, streaming cut big data down to size
- Coresets offer a path to understanding big data
- Machines learn from experience