UW News

July 22, 2010

Scandinavian scholar one of 12 to get Google grant seeking to transform literary scholarship

A UW doctoral candidate is participating in a history-making research effort that unites the worlds of technology and the humanities.

Peter Leonard, who is working on his dissertation in Scandinavian studies, and his research partner, Tim Tangherlini of UCLA, have received a grant from Google for a project titled Northern Insights: Tools & Techniques for Automated Literary Analysis, Based on the Scandinavian Corpus in Google Books.

The two won a $45,000 grant, one of just 12 given by Google to encourage scholars to explore the corpus of 12 million digitized books in more than 300 languages it has amassed in Google Books. It is the company’s first effort to support such research.

For Leonard and Tangherlini, Google Books offers access to about 160,000 texts in the Scandinavian languages of Danish, Swedish and Norwegian. That’s a relatively small number when compared to a language like English, because speakers of these languages are almost entirely confined to their native countries. But the relative smallness of the Scandinavian literary world is an advantage for the project, according to Leonard.

“We wanted to use the benefits of a small, tri-national canon as a kind of sandbox, a test bed to develop tools and techniques for figuring out what works in this burgeoning new field of electronic text analysis,” he said.

Electronic text analysis is the term used to describe projects in which humanities scholars turn computers loose on a large number of texts to answer particular questions. Whereas traditional humanities scholarship is based on deep reading of a few texts, the availability of scanned texts and new tools to search and analyze them make quantitative research more possible.

One of Google’s motives in offering the grants is to encourage the development of new tools to work with the texts.”Digitization is just the starting point,” the company said in its news release announcing the grants. “It will take a vast amount of work by scholars and computer scientists to analyze these digitized texts.”

Leonard and Tangherlini are well suited to pursuing this kind of scholarship. Tangherlini, professor and chair of the Scandinavian section at UCLA, is well known in the field for similar work on Danish folklore. A recent research project, for example, involved the use of a computer program to identify ghost stories among a body of Danish folk tales. Leonard, meanwhile, worked in the computer industry for a number of years before deciding to pursue graduate studies.

Leonard learned about the Google opportunity through a mailing list he participates in and approached Tangherlini as a partner because of the latter’s expertise. The project the two are pursuing under the Google grant involves seeking out networks of influence in Scandinavian literature.

“We’re asking, ‘What are the ways that authors we’ve always put together in a box turn out to be distant from each other and other authors we thought were very distant from each other actually connect in interesting ways?'” Leonard said. “For example, we could look at the way they’re talking about nature, or the way the way they’re talking about themes involved in folklore. It might be the way they use adjectives in connection with nouns. All that sounds impossible; if you read through hundreds of novels, how would you even know this? But the answer is, you can get computers to do this for you.”

Along the way the two will be developing tools for doing this kind of network analysis, as well as making it possible to do the analysis across multiple related languages. Linguists describe Scandinavia as a dialect continuum rather than three separate languages because they overlap so much, Leonard said. And the literary worlds of the three countries overlap too. Therefore, it would be very useful for scholars to have a tool that could search texts in all three languages simultaneously. To do that, Leonard and Tangherlini will need to create a tool that recognizes different forms of the same word.

Google Books is an enterprise that has stirred controversy because of the fact that some of the books included are still in copyright. In 2005 the Authors Guild filed a lawsuit accusing Google of massive copyright infringement. The suit was settled in 2008 with Google agreeing to a $125 million payout, but the settlement is still awaiting approval by a federal court.

Leonard called the lawsuit a “sticky wicket,” pointing out that in-copyright books are presented to the general reader only in snippet form and may actually lead to more book sales.

“I interned at the University of Copenhagen Press, and instead of suing, they’ve become a partner in Google Books. They gave their books to be scanned, and when people get the few sentences, there’s a link there to buy the book,” he said.

The proposed settlement would permit the use of in-copyright works owned by universities for “noconsumptive” computational research — the kind that Leonard and Tangherlini are doing.

Leonard calls Google Books “a really important, historically unprecedented collection of printed texts,” enough to make any researcher salivate.

“It’s rare when someone transforms a technique of literary scholarship and this is that moment,” he said. “Google, for better or worse, has put a lot of money into digitizing books. In a sense it’s more of a transformation than the electronic card catalogue. With the catalog, you still had to go get the book. This is the book in front of your eyes, as long as it’s pre-1923 copyright, as long as you read the language, as long as you have the cultural and linguistic knowledge to know what you’re reading.”

The grant runs for one year, with the possibility of renewal for a second year.