Taming the 'Genomical Beast' with Big Data Resources

By Shelley Littin, iPlant Collaborative

In a recent study published in PLoS Biology, biologists worry that the future rise in genomic data consequent to dropping genetic sequencing costs will strain the computational resources of the discipline beyond its capacity to store, analyze, and distribute large datasets.

“Projecting to the year 2025,” the authors write, “we compared genomics with three other major generators of Big Data: astronomy, YouTube, and Twitter. Our estimates show that genomics is a ‘four-headed beast’ – it is either on par with or the most demanding of the domains analyzed here in terms of data acquisition, storage, distribution, and analysis.”

The authors discuss technologies they predict will be needed to address future computational challenges posed by genomics, including a need for tools for data acquisition, storage, distribution, and analysis.

Yet there is already in place an infrastructure to handle the projected rise in genomic data – and any large datasets inherent to the life sciences.

In regard to the projected future rise in genomic datasets, said project co-principal investigator Eric Lyons: “We are ready to meet this challenge today.”

The iPlant Collaborative, headquartered at the University of Arizona (UA), is a National Science Foundation-funded cyberinfrastructure project providing computational support to life science researchers in the form of secure data storage, services for data analysis, and the underlying infrastructure to share datasets among collaborators anywhere in the world with an Internet connection.

“Currently today we are managing 1.1 petabytes of user data, broken into 88 million data objects with associated metadata.” In addition to helping direct the iPlant Collaborative, Lyons is an assistant professor in the UA’s College of Agriculture and Life Sciences and BIO5 Institute. “Over half of these datasets are shared among two or more users,” Lyons said. “People can input huge datasets and easily share and collaborate from anywhere on the planet.”

The iPlant Collaborative provides an array of platforms, services, and tools empowering data scientists in all disciplines.

Originally created for plant science research, the iPlant infrastructure borrowed concepts and technologies from other disciplines, including astronomy and physics. Now, Lyons said, “those groups are looking to see what iPlant has done in terms of broadening inner connections so they can take best practices we’ve pioneered addressing biological problems back to their communities, thereby completing the cycle of scientific sharing.”

The PLoS Biology article’s argument is timely, Lyons said. “We’re generating data on a massive scale because the technology to generate it is becoming faster, cheaper, and more available.” Yet, he noted: “While genome sequencing is the most visible of the data revolutions happening in biology, it’s only one of many.”

And big data revolutions are exactly what iPlant’s Data Store is built for – regardless of the science discipline: “It’s a distributed data system that appears as a unified resource,” Lyons explained. Built upon iRODS, a scalable, high-performance data-management software, the Data Store brings data to iPlant users and allows researchers to connect their resources from anywhere in the world.

“Once data is in the iPlant Data Store, we have ensured an easy, accessible connection to different types of compute resources such as iPlant's cloud, iPlant’s grid computing, the XSEDE supercomputing infrastructure, or a third-party cloud like Amazon,” said Lyons.

iPlant also provides access to genomics-specific tools and services. Lyons created CoGe, an online platform for comparative genomics research that provides a network of tools to compare, analyze, and visualize next-gen data – all hosted by the iPlant Collaborative’s cyberinfrastructure. CoGe currently hosts over 24,000 genomes of over 16,000 organisms, and is only growing in its computational capacity. It is only one of many big data platforms powered by iPlant.

“We at iPlant are focusing on building the middleware to connect big data science resources together seamlessly, with single sign-on to access your data and compute resources able to talk to one another, and making sure that these resources are connected to the end user scientist regardless of their skill level so that field, bench, and computer scientists have the resources to do quality big data research,” Lyons said.

iPlant also provides training resources to help scientists at all levels learn to use the computational tools they need to work with large data systems.

And when it comes to the problem of data storage, Lyons believes that storing processed data may be more feasible, and more valuable to science in the future, than attempting to store all raw data generated.

Michael Schatz, an associate professor of quantitative biology at Cold Spring Harbor Laboratory and adjunct assistant professor of computer science at Stony Brook University, and one of the PLoS Biology study authors, agrees. When it comes to storing data in the future, he said: “In genomics it’s going to be really important to think about aspects such as data compression, and the different analyses that need to take place. I think that’s where iPlant is going to play a role, providing data storage, compute resources, and technological interfaces – making massive amounts of computer resources usable to specialists in high performance computing.”

And if dataset numbers and sizes, in genomics and other disciplines, skyrocket in the future as predicted?

“We’re always keeping our eyes on the future,” said Lyons. “We realize that in life sciences the problem of big data is immense. iPlant has worked, and will continue working, with existing cyberinfrastructure projects regardless of the science discipline and technologies used, to be prepared for the future.”