November 18, 2011

Analyzing massive datasets is subject of major international conference

News and Information

Seattle is host this week to the major international meeting about high-performance computing, giving UW scientists and computer specialists an opportunity to see over the horizon at developments that will influence how research is conducted for years to come.

SC11, held Nov. 12 to 18 at the Convention Center, is billed – accurately, by all accounts – as “the international conference for high-performance computing, networking, storage and analysis.” It brings together the top groups in the world from supercomputing research facilities, federal computing networks in the U.S., as well as vendors.

In past years, some groups have used the conference as an opportunity to demonstrate their prowess in pushing enormous amounts of data through fiber optic cables. While impressive bandwidth and number-crunching are still hallmarks of the meeting, participants now include a large contingent of companies and researchers whose primary interest is in working with massive amounts of data – although in the world of scientific research, such approaches may overlap.

Cloud computing booths in the exhibit hall are as large as those from more traditional vendors, says Jeff Gardner, affiliate assistant professor of physics, who works in the UWs eScience Institute – although the number of cloud computing vendors is still a small fraction.  “There is a lot of talk in the field about cloud computing as an emerging service model and what it can offer to the research community.”

Gardner points out that massive datasets have been a mainstay of many fields for a long time – for example, his own work in astrophysics has required large simulations that generate enormous amounts of information.

“Were now moving into an age in which observation and experimental science is also capable of creating massive datasets,” he says. Advances in sensors and other data collection technologies are facilitating the creation of large and rich datasets, which opens up new possibilities in the way that the data is analyzed – because researchers can employ much more sophisticated techniques of analysis, such as data mining. “As an astrophysicist who works analyzing data from simulations, I came up against the limitations of working on other computing systems, and sometimes had to scale back on the simulations themselves, which is not the best way to do science. Now, many more disciplines are ‘hitting the wall in terms of data analysis, and thats driving innovation.”

The central feature of the conference remains what are called “leadership class” computer systems – the few hundred  biggest and fastest machines, capable of working on the thorniest and most computationally-intensive problems. In the U.S., these systems are often operated by federal agencies including the Department of Energy, the National Oceanic and Atmospheric Administration, the National Aeronautics and Space Administration and the National Science Foundation.

While those public facilities remain critically important to the research community, there is a parallel growth occurring in the private sector, often conducting its own research with a decidedly market-oriented angle. For example, Toyota and other auto makers create very sophisticated and complex simulations of crashes, employing thousands of sensors that generate a seemingly never-ending flow of information.

“There is a global trend emphasizing not just computation but the ability to process gobs of data,” says Bill Howe, senior scientist at the eScience Institute and affiliate assistant professor of computer science and engineering.  “Its not an overstatement to say that nearly every biologist, for example, will need to become proficient in data analysis to be competitive in that field. Almost anywhere you look now in science, research can generate terabytes of information.”

Howe says the big push into cloud computing, largely on commercial platforms, is perfect for addressing data-intensive problems. “Were now able to purchase computer power effectively, so that buying one hour of time on 1,000 computers costs about the same as buying time on one computer for 1,000 hours.”

SC11 provides a great opportunity for bringing together people from the database community with experts on high-performance computing. “The time is right for these discussions,” Howe says.

Commercial interests in finding patterns among masses of data parallel those in the research community. “In the past, only large companies such as Walmart were able to collect and analyze massive datasets of customer behavior. But as computing technology proliferates and costs drop, even small companies are developing predictive models with commercial applications,” he says.

These emerging needs are being recognized by the heavyweights in cloud computing, which will likely develop solutions that work for businesses and a segment of the research community, Howe says.

It is true that a lot of the development of data analysis techniques and technologies is occurring in the private sector, in companies like Google and Microsoft. Some believe that the role of computer scientists at universities in this arena will become unnecessary, as the private technology companies are likely to have access to computing and storage resources that are many orders of magnitude than those available at universities. But Gardner believes that it is still up to the funding agencies and research institutions like the UW to drive innovation in scientific computing and data analysis. “The research computing market is not huge,” he says, “and this community is likely to drive much innovation that meets the needs of this specialized group.”

Meanwhile, the presence of this meeting in Seattle is made possible in part by the presence of a robust network infrastructure developed by UW-IT and Pacific Northwest Gigapop, a nonprofit corporation serving research and education organizations throughout the Pacific Rim. The network is providing high-capacity access from the Convention Center, where the meeting is being held, to the Westin Building, which is the regional nexus for national and international research and education networks.  UW-IT donated the fiber between the Westin and the Convention Center, which supports the networks.

The conference will boast networks that are that are ten times as fast as the fastest circuits commonly used on campus, according to Jan Eveleth, managing director of research networks for UW-IT.  In addition, the UW and Pacific Northwest Gigapop are providing telecommunications equipment racks for vendors and other participants to install gear for the advanced high-end optical equipment showcased at the event.  UW eScience experts will be at the event to answer questions from researchers about working with large datasets in a variety of fields.