UW News

February 19, 2004

Computer translations of spoken word — the new data source

In the not too distant future, if you miss a meeting, you’ll likely be able to check a database prepared by a computerized secretary that recorded, indexed and stored the event in such a way that you can search for the main topics of discussion, find out who committed to do what, determine participants’ stances on the topics at hand or pinpoint courses of action.

And the same capabilities could be available for television and radio news broadcasts, teleconferences, lectures, call centers, courtroom proceedings and instantaneous foreign language translation, according to Mari Ostendorf, professor in the Department of Electrical Engineering at the UW and a leading researcher in the field of computer voice recognition.

“If you think of the amount of time you spend talking as opposed to reading documents, you’ll realize that you spend much more time talking,” Ostendorf said. “We have this speech data that is a huge potential information source, and it’s largely untapped. It really is the next generation data source.”

This, she said, is the next critical dimension that computer speech recognition experts have set their sights on.

But there are some daunting obstacles. Computers already do a fairly good job of recognizing human speech when a person is talking directly to a computer. “We’re willing to talk differently so that the computer can understand, because we have an objective,” Ostendorf said. But human-to-human speech is different.

“When people talk to one another, they speed up, they slow down, they get excited, they get bored, they show emotion, the pitch goes up, the pitch goes down, there are interruptions, the speech overlaps, the speaker changes — there is a lot of variability,” she said. There are “ums” and “ahs,” repetition and hesitation. It’s not just a matter of what we say, but how we say it.

“We don’t notice these disfluencies — they just pass us by; we filter them out,” Ostendorf said. “But they are there. And a computer has to do something with them.”

Researchers have developed computer programs that look at speech according to its most basic sounds. The programs identify the sounds and how they are sequenced, then use a probability model to decide what the words most likely are. That still leaves out a big chunk of the equation.

“If you’re just recognizing the words, that would be like taking a piece of writing text and taking out all the capitalizations, all the punctuation, all the paragraph boundaries,” Ostendorf said.

The latest work is examining ways to get the computer to also look at larger swaths of conversation in terms of how the speech is delivered. In other words, people give all sorts of clues in speech that indicate how their communication is structured, and often those clues are spread out over whole syllables, several words or a phrase. For example, speakers often raise their pitch for a sentence that changes the topic, much as a paragraph signals a change in written text.

So just how far away are we from computers that can reliably follow the vagaries of human speech?

“In terms of getting the words for say, broadcast news, we’re probably less than five years away from getting the words right,” Ostendorf said. Getting at the structure and meaning so the speech can be summarized is a harder proposition.

“We’re really just beginning this aspect of it, so I’m hesitant to speculate on a timeframe,” she said. “But the field is moving incredibly fast now. It’s likely to be sooner than many people think.”