Passing the Big Data Test

By Shelley Littin, Cyverse

Ann Stapleton has turned her laboratory into a powerhouse for workforce development, training the next generation of data scientists in cutting-edge data analysis techniques. More than a dozen students under her tutelage actively contribute to large-scale science projects, conceptualizing, developing, and testing product and analysis tutorials for CyVerse platforms.

Every new data analysis method must be tested against existing methods to ensure that the new option works as well, if not better.

“This is very labor-intensive,” said Stapleton, an associate professor of biology and marine biology at the University of North Carolina at Wilmington (UNCW), “especially if you want to test a new analysis method against multiple existing methods. There’s a real benefit to building an extensible and resource-scalable system.”

Stapleton, who has a background in plant biology and genetics, began working in cyberinfrastructure in 2001 when she teamed up with colleagues to develop an infrastructure called GridNexus. The project developed a scientific workflow interface that was used for more than ten years for teaching grid computing. Stapleton soon became a pivotal figure in biology cyberinfrastructure. Now she works at the interface of teaching and research.

“We generate test data, run it through analyses, and analyze the results to ensure the analysis program works the way it’s supposed to. This is the positive control for our user’s data analysis projects,” Stapleton explained.

“What’s special about what we’re doing is that it’s scalable,” she continued.

The challenge, Stapleton said, is that “many modern analyses can’t be done on a laptop or even on the resources available at a given institution. And, there are many universities where people don’t have access to computational resources. Finally, many options require quite a bit of technical expertise and interest.”

“What CyVerse provides is the ability to scale up analyses that are too big to be done on a laptop, using the Agave API to manage the process.” The Agave API platform underlies CyVerse data management services and provides hosted services for researchers to access data and publish and share results.

Stapleton is deeply invested in preparing students for work in the real world, and strives to build her lab into what she calls a “mentoring machine.”

“We try to do everything exactly as in industry. We do all the things that happen during production: task selection, modeling, testing, tutorial writing, and aspects of software development that occur in the real world. Training students in best practices on a small scale will make it easier for them as they transition to larger operations.”

Her students have a variety of tasks: writing software code and scripts, debugging existing code, developing tutorials and documentation, maintaining the computational infrastructure, ensuring the data tests are up to muster, and testing new code – just to name a few.

One such project is the CyVerse Validate workflow, which provides a versatile, scalable positive-control analysis for researchers’ GWAS (genome-wide association studies) and whole-genome prediction (Genomic Selection, GS) tests.  

“With the help, information, and tutorials that we give them, and of course our code in Validate, we can get researchers started optimizing their genome-wide association and prediction analyses,” Stapleton said.

“Ultimately, we want to encourage researchers to use additional tools and add to the documentation, to make it easier for people to learn to install their own programs and manage their own user groups,” Stapleton noted, referencing CyVerse’s mantra, a project that is “of, by, and for the community.”

Validate initially was designed by a student, Dustin Landers, who joined Stapleton’s lab shortly after adding a graduate program in applied statistics to his social sciences background.

“I was very interested in pursuing a career in data science,” Landers said. “I was extremely excited to pursue the opportunity to work with Ann on the CyVerse Validate project.”

The project required knowledge of genetics, biostatistics, machine learning, and statistical modeling, so Landers met and worked with diverse researchers and programmers as he designed Validate. “It required thinking a lot about who the end users would ultimately be, and how they would want to use such a tool.”

“Working in the Stapleton Lab helped me sharpen my skills as both a scientist and a software engineer,” Landers added. “It was beyond rewarding.” Landers now applies his skills and experiences as a data scientist for a Fortune 500 insurance company.

But Landers has not been the only one with a not-so-STEM background to find his way into Stapleton’s lab: the documentation for Validate (among other projects) was completed by Austin Allen, a creative writer and poet of UNCW’s master of fine arts program.

“I've always felt a bit torn between my interests in computers and technology and my inclinations towards the humanities,” Allen said. “As a writer who grew up on the Internet, I'm interested in different ways of documenting my generation's experience with a rapidly-changing technological landscape.”

“The Validate project was a fantastic chance to synthesize my interest in computer science, pedagogy, and writing,” he continued. “Making sure that the "why" is in the documents as well as the "how" is a challenge that transcends genre.”

Allen plans to return to the poetry classroom with a new challenge for his students: attempting a GitHub-style approach to writing assignments, in which multiple authors contribute to one another’s poems as though they are collaborating on an open-source project.

“I like the idea that poetry, like some ancient code or script for our own hardware, bodies, or souls, might not be so dissimilar from lines of programming languages such as Python or C++ – a structured yet creative human solution to a given problem,” Allen said.