'Big Data' Big Dreams

Dr. Olivier Elemento

New methods of processing massive amounts of information offer powerful weapons against threats ranging from cancer to bioterror

By Heather Salerno

Could the key to curing cancer be hidden deep within the disease itself?

That's a theory proposed by Olivier Elemento, PhD, an associate professor of computational genomics in the Department of Physiology and Biophysics who heads the Laboratory of Cancer Systems Biology in the Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine at Weill Cornell. His idea: if doctors can identify the specific genomic mutations that made a patient's normal cells become cancerous ones, then perhaps they can devise more effective treatments. "The genome is essentially the blueprint of cells; it contains the instructions, the software code, that makes each cell run" says Elemento. "Thus, cancer cells have an altered blueprint. We want to know how the software code evolves and changes in these cells.

But translating that code is no easy task. Each human genome contains 3.4 billion nucleotides, written out in the language of DNA. If all that information were printed in telephone books, they'd stack nearly to the top of the Washington Monument.

It's a prime example of "big data" a term used to describe information so large and complex that it's difficult to parse with standard software tools. And while the ability to query and examine massive datasets has already revolutionized the business world — Google and Facebook are just two of the major players who profit from customer analytics — big data is now poised to transform the public health sector. The ability to aggregate reams of digital material, along with technological advances that can make sense of it all, is already altering the way physicians and scientists at Weill Cornell are conducting research.

For Christopher Mason, PhD, assistant professor of computational genomics in the Department of Physiology and Biophysics and the Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, the impact of big data is "nothing short of revolutionary." He stresses the importance of collaboration, since interpreting such extensive data often requires more resources than any one institution can marshal alone. "In only four or five short years" he says, "we've gone from many isolated islands of data to a global effort of interconnected patient and clinical data that lets you do new science really fast." One such effort that Weill Cornell has joined is the 150-member Global Alliance for Genomics and Health, a recently established coalition of healthcare providers, research institutions, and disease advocacy groups dedicated to the secure sharing of genomic information. "More data" Mason says, "is more power."

A newly launched clinical research network in New York City led by Rainu Kaushal, MD, chair of the Department of Healthcare Policy and Research and the Frances and John L. Loeb Professor of Medical Informatics, will, with their permission, share patient information among participating organizations, a move that could potentially advance medical breakthroughs and develop a more personalized approach to patient care. "As we become more wired in healthcare—which, of course, has lagged behind other industries significantly—having access to this data opens up an entire laboratory of information that we didn't have previously," she says.

In Elemento's lab, researchers are using big data techniques to analyze the genomes of tumors from patients with blood, brain, prostate, and other cancers. Using supercomputers, custom software, and algorithms they designed for this purpose, they comb the data for patterns of mutation. The hope is that identifying such anomalies may point the way toward new therapies that target weaknesses in cancer cells. The process has numerous advantages over conventional research methods, including increased efficiency; the data has allowed Elemento's team to build computer models that test the efficacy of various drugs before any are actually administered. "There are potentially hundreds of thousands of ways to combine drugs, at different dosages, and it's hard to simply guess which drug combinations will work" says Elemento. "We think our computer models of tumors will be increasingly used to predict the best ways to treat each patient."

One of the problems with big data is that it's, well...big. To give a sense of its scope: within eighteen months, Elemento and his colleagues filled a data storage system that could accommodate 300 tera bytes of information. Twenty years of observations by NASA's Hubble Space Telescope, by comparison, produced only about forty-five terabytes of data. "Technology is not yet such that you can sequence the genome in one run" explains Elemento. "You have to sequence a genome multiple times — about 100 times, in fact — to get an accurate representation." (Recognizing the data-storage needs at Weill Cornell, the NIH has awarded a grant to establish a petabytescale storage facility in the Institute for Computational Biomedicine. A petabyte is 10 15 bytes of digital information.)

There are other complications, too. Traditionally, scientists form a hypothesis, conduct an experiment, and evaluate the evidence. Big data often works in reverse, with investigators collecting piles of information first and then looking for correlations. With such huge datasets, there is a risk of false — or biased — findings. "Accumulating data is wonderful, but if you don't have a rigorous framework in which queries and explanations can fit, then it's just data" says Harel Weinstein, DSc, the Maxwell M. Upson Professor of Physiology, chair of the Department of Physiology and Biophysics, and director of the Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine. "If there is no valid conceptual framework, it's not scientific."

One high-profile case of big data getting it wrong is the Google Flu Trends (GFT) project, once touted as a mega-analysis success story. The tool earned high praise for using Internet search queries to monitor flu-related activity, putting Google's real-time updates ahead of findings by organizations like the Centers for Disease Control and Prevention, which bases its estimates on reports from labs nationwide. But as it turned out, in some instances GFT badly miscalculated. Soon after it inferred that nearly 11 percent of the country's population had the flu at its peak in January 2013,Naturereported that Google's algorithms had significantly overstated infection rates. Research published recently in Sciencepoints out other flaws in the flu-tracking system, with the authors concluding that "'big data hubris' is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis." They acknowledge that big data offers enormous possibilities for new insights, but only if combined with other types of information gathering.

The bottom line for many experts? Big data is a highly promising strategy, but context must always be applied when crunching numbers. Says Kaushal: "I think it can be — and will be — transformative in the way we think about health."

New York City Clinical Data Research Network

Earlier this year, Weill Cornell received a $7 million contract for an ambitious initiative with the potential to improve healthcare in the nation's largest, most diverse city. Led by Rainu Kaushal, the New York City Clinical Data Research Network (NYC-CDRN) will create a system that supports data sharing for patients across New York City and facilitates recruitment of patients for clinical trials. "We will better be able to understand, in a comprehensive and longitudinal way, a patient's medical experiences" says Kaushal.

Funding for the network came from the Patient-Centered Outcomes Research Institute (PCORI), an NGO supporting investigations that help people make informed healthcare choices. Over the course of the eighteen-month contract, NYC-CDRN expects to link records for at least a million patients, whose identities will be protected. The consortium — made up of twenty-two regional entities, from Weill Cornell to the New York Genome Center to the new Cornell Tech campus — will initially focus on patients with diabetes, obesity, and cystic fibrosis.

According to Kaushal, one goal is to give patients and providers speedy access to information they can use in making medical decisions. For example, she says, "Every diabetic is not created equal. Some may benefit from drug A, some from drug B." This kind of shared database could also issue automatic notifications that pop up onscreen when a physician accesses electronic health records; such alerts could quickly spread the word about new protocols for certain conditions, as well as a patient's eligibility for research studies. "If we can do this in New York City" Kaushal says, "we will have created a prototype for almost any city across the country."

EMcounter

The Kumbh Mela

The Kumbh Mela, which draws tens of millions of religious pilgrims, is the world's largest human gathering. Photo credit: Dr. Satchit Balsari

In 2006, when three residents at NewYork-Presbyterian were chatting about their experiences at overseas medical facilities, they all mentioned the same thing: that many developing nations were teaching emergency medicine using a U.S.-based curriculum that didn't reflect the predominant illnesses in those regions. "There was definitely a difference in the chief complaints that came into emergency rooms in other countries" says Satchit Balsari, MD, a native of Mumbai and assistant professor of medicine and of healthcare policy and research who attended medical school in India. "Chest pain was not necessarily the number one complaint, as it sometimes is here. There were a lot more infectious diseases, so there were more people complaining about things like fever."

So Balsari and his co-residents, Dave Anthony, MD, and Dean Straff, MD—all now emergency attending physician faculty members at NYP/Weill Cornell — designed a user-friendly system, dubbed EMcounter, that allowed staffers at a hospital in Chennai, India, to digitally record patient information. Analysis of the data recorded over the course of a year spurred changes at that hospital; for example, after the program pinpointed the busiest times of day, administrators added more staff to better meet patient needs. A follow-up program at a South Sudan hospital in 2012 was equally successful, and last year the tool was used at the Kumbh Mela in India, the world's largest human gathering. The Mela is a religious festival drawing tens of millions of pilgrims, who journey every twelve years to bathe in the sacred waters of the Ganges and Yamuna rivers.

Dr. Satchit Balsari in India

Dr. Satchit Balsari (second from right), with researchers in India. Photo credit: Dr. Satchit Balsari

Some thirty researchers, including Balsari, spent more than a month in Allahabad, India, to gather data from a huge, temporary city that hosts the Mela. Balsari's team wanted to see if EMcounter could flag potential epidemics at such a large gathering. The challenge was that the Mela population could fluctuate by millions every few days, making it difficult to tell whether an increase in sickness was simply because of new visitors. But Balsari figured that by studying the data in real time, they'd notice if a single disease rose out of proportion to others. "We said if we can do it in complete chaos, in the world's largest mass gathering, and still make it work, then the system would be pretty robust for other extreme situations" Balsari says.

Using iPads, researchers collected health information from more than 50,000 patients over three weeks, and their hypothesis proved correct. For example, they identified a spike in viral respiratory problems on the busiest bathing day, when an estimated 30 million people attended the Mela. Luckily, those who became ill didn't require significant medical attention. But in the case of a more serious disease with a higher fatality rate — such as measles or cholera — they had the means to notify local authorities to mobilize an immediate response. Now, the team is working on bringing the system to other transient settings like refugee camps and disaster shelters. "It would be extremely useful in a humanitarian crisis" Balsari says. "It could be used on the borders of Syria, in camps where populations are constantly changing. There's a huge flux every day, with hundreds of thousands moving in and out."

PathoMap

Anyone who's ridden on the New York City subway has probably wondered (or maybe tried not to think about) what kind of germs lurk in the crowded, grimy transit system. Now a project called PathoMap is seeking to answer that very question. It examines bacteria, viruses, and other microorganisms present in high-traffic areas New Yorkers encounter every day.

Circos plot

A figure known as a Circos plot depicts human chromosomes spread out in a circle. At the center is the genetic data of five cancer patients, with orange/blue dots showing increased/decreased levels of DNA. Called copy-number variants, these indicate places where fragments of DNA have been rearranged, possibly driving the tumors. Image credit: Priyanka Vijay

A team of postdocs, graduate students, and student volunteers led by Christopher Mason visited 468 New York City subway stations last summer, collecting 1,404 samples by swabbing turnstiles, railings, kiosks, and benches. Thousands of additional samples were taken at the same sites last fall and winter, as well as this spring, to establish a seasonal baseline of pathogens. Mason's team then applied advanced sequencing technology to analyze the accumulated pieces of DNA. "But these fragments are like taking an entire library and shredding the pages of all the books" Mason says. "So a large component of the work after the sequencing is complete is to use large-scale computational resources to assign the fragments of DNA to the correct species."

Analysis is ongoing; among the initial discoveries are staphylococcus and streptococcus bacteria, shigella boydii (found in feces and associated with dysentery), and rat DNA. Yet Mason insists those results are not any worse than what can be found in many other public places. "I'd say the subway is no more or less gross than the place where you buy your lunch" he says, "or the bathroom at Grand Central." Also, he points out, the majority of the bacteria they have found are non-pathogenic and even likely beneficial — serving as a layer of "good bacteria" that can out-compete any potential threats.

Once Mason's team builds a database of the pathogens normally found in the subway system, he hopes to develop a way to search for organisms that are out of the ordinary. This could aid in disease prevention by letting the public know which zones are trouble spots. In future stages, government agencies could use the technology to monitor the system, possibly preventing or containing a bioterror attack. "There are technologies where, in one to two hours, you can swab something and completely characterize all the DNA" says Mason. "In the future, this is the world we will likely see — 'big data' of every molecule of DNA from almost every surface of a metropolis being monitored."

This story first appeared in Weill Cornell Medicine, Vol. 13, No. 2.

Weill Cornell Medicine
Office of External Affairs
Phone: (646) 962-9476