A group of researchers from UC Berkeley, the Broad Institute and UC Santa Cruz were recently awarded one of three National Cancer Institute contracts to create software that stores and analyzes cancer genomic data.
According to Juli Klemm, senior scientific advisor for the institute’s Center for Biomedical Informatics and Information Technology, the Cancer Genomics Cloud Pilot contracts were granted in late September to support research and development of a system for analyzing large sets of cancer genomic data. The specific data the Broad-University of California Cloud Pilot will use is generated through the Cancer Genome Atlas, a dataset mapping genetic changes funded by the institute.
“We realized we needed to provide a compute infrastructure that would allow researchers who might not have access to a large research facility … to ask interesting questions of that data,” Klemm said.
According to David Patterson, a UC Berkeley computer science professor who is leading the campus research team, there have been tremendous advances and cost decreases in genetic sequencing.
Patterson said this means more people could potentially have their genomes sequenced, presenting researchers with larger datasets. To understand the implications of these huge datasets, there needs to be a software tool that would allow researchers to use cloud computing to analyze the genetic data, he said.
“The potential is if we can put together 100,000 patients who all have the same kind of cancer, we can make discoveries about what kinds of drugs could work,” Patterson said. “We could come up with better treatments.”
Klemm said the project will allow scientists to better understand the molecular basis of cancer.
“We want to ultimately improve the prevention, and diagnosis and treatment of cancer,” Klemm said.
The researchers have just begun the project, which is expected to run for about two years. UC Berkeley researchers will be responsible for storing and processing data, while those from UC Santa Cruz and the Broad Institute will focus on visualizing the data and developing a platform for scientists to access the data.
Klemm noted that one of the main challenges is the sheer size of these datasets, which will total roughly 2.5 petabytes once the project is completed. These datasets are larger than any previous biomedical database, she said.
Other challenges include concerns about patient privacy, Patterson said. Genomic sequence data, which consists of human subjects, may contain data that is protected under subject confidentiality and is personally identifiable.
Eventually, the goal is to develop open source software for widespread use, he said.
Each contracted team will come up with a different technical method, and the institute will evaluate the various approaches after the systems are completed.
The Institute for Systems Biology, in collaboration with Google, and Seven Bridges Genomics Inc., were also awarded contracts.