UW News

February 13, 2004

Researchers target computer recognition of speech as the next-generation data source

In the not too distant future, if you miss a meeting, you’ll likely be able to check a database prepared by a computerized secretary that recorded, indexed and stored the event in such a way that you can search for the main topics of discussion, find out who committed to do what, determine participants’ stances on the topics at hand or pinpoint courses of action.

And the same capabilities could be available for television and radio news broadcasts, teleconferences, lectures, call centers, courtroom proceedings and instantaneous foreign language translation, according to Mari Ostendorf, professor in the Department of Electrical Engineering at the University of Washington and a leading researcher in the field of computer voice recognition.

Ostendorf addressed the topic during an afternoon press briefing today at the annual meeting of the American Association for the Advancement of Science, the world’s largest scientific meeting, held this year in Seattle. She is scheduled to give the overview presentation for a Saturday symposium titled “Scientific Problems Facing Speech Recognition Today.”

“If you think of the amount of time you spend talking as opposed to reading documents, you’ll realize that you spend much more time talking,” Ostendorf said. “We have this speech data that is a huge potential information source, and it’s largely untapped. It really is the next generation data source.”

This, she said, is the next critical dimension that computer speech recognition experts have set their sights on.

But there are some daunting obstacles. Computers already do a fairly good job of recognizing human speech when a person is talking directly to a computer. “We’re willing to talk differently, so that the computer can understand, because we have an objective,” Ostendorf said. But human-to-human speech is different.

“When people talk to one another, they speed up, they slow down, they get excited, they get bored, they show emotion, the pitch goes up, the pitch goes down, there are interruptions, the speech overlaps, the speaker changes – there is a lot of variability,” she said. There are “ums” and “ahs,” repetition and hesitation. It’s not just a matter of what we say, but how we say it.

“We don’t notice these disfluencies – they just pass us by; we filter them out,” Ostendorf said. “But they are there. And a computer has to do something with them.”

Researchers have developed computer programs that look at speech according to its most basic sounds. The programs identify the sounds and how they are sequenced, then use a probability model to decide what the words most likely are. That still leaves out a big chunk of the equation.

“If you’re just recognizing the words, that would be like taking a piece of written text and taking out all the capitalizations, all the punctuation, all the paragraph boundaries,” Ostendorf said.

The latest work is examining ways to get the computer to also look at larger swaths of conversation in terms of how the speech is delivered. In other words, people give all sorts of clues in speech that indicate how their communication is structured, and often those clues are spread out over whole syllables, several words or a phrase. For example, speakers often raise their pitch for a sentence that changes the topic, much as a paragraph signals a change in written text.

So just how far away are we from computers that can reliably follow the vagaries of human speech?

“In terms of getting the words for say, broadcast news, we’re probably less than five years away from getting the words right,” Ostendorf said. Getting at the structure and meaning so the speech can be summarized is a harder proposition.

“We’re really just beginning this aspect of it, so I’m hesitant to speculate on a timeframe,” she said. “But the field is moving incredibly fast now. It’s likely to be sooner than many people think.”

###

For more information, contact Ostendorf at (206) 221-5748 or mo@ee.washington.edu.