The audio content of multimedia presentations is inaccessible to people who are unable to hear. If there is content presented auditorially, the accessibility solution is captioning that provides a synchronized text alternative to the audio track. For additional general information about captioning, see How do I make multimedia accessible?
Many educational entities produce large quantities of videos for their distance learning programs, outreach, marketing, and other functions. Also, a growing number of institutions are turning to multimedia as a means of enhancing their Web-based curricula. The cost of captioning all this video and multimedia content has many institutions concerned and exploring their possibilities. Many institutions are outsourcing on an as-needed basis, but must be careful to ensure they can receive the accessible media in a timely fashion. Often prompt turnaround requires additional cost. Other institutions are developing the expertise to provide captioning in-house.
Researchers continue to explore options for automating portions of the captioning process. Some educational entities and other organizations are using products or services that utilize some degree of automated captioning.
The best-case scenario would be fully automated captioning using speech recognition technology. Unfortunately, current technology is not accurate enough to fully support this approach. However, research and development toward this goal has been fueled by a rapidly growing market for video search and archival systems. In order to archive and index digital multimedia so that users can search its content, at least a portion of that content needs to be text-based. The first company to utilize speech recognition in this market is Virage®, whose VideoLogger™ application uses speech recognition to capture text from a video, which it then uses to build a structured searchable index. However, because of the accuracy limitations of speech recognition, this tool cannot yet be used to generate entire caption tracks; it is used instead to extract sets of keywords, including only those words that the software can interpret with a high level of confidence.
A number of other companies have entered the market and are currently at various stages in developing semi-automated solutions. Perhaps most notable among these is Scansoft® Dragon AudioMining™, from the maker of popular speech recognition product Dragon NaturallySpeaking®. Scansoft's Text Captioning Solutions page does a good job of describing current approaches to using speech recognition for captioning, including the technology's limitations.
The greatest limitation with speech recognition technology is that it is only accurate in optimum situations, where the speaker has devoted time to training the software to recognize his or her speech patterns, where audio quality and the accoustics of the recording environment are excellent, and where distracting background noises are minimal. Few multimedia presentations meet these criteria. While fully automated captioning may not currently be possible, speech recognition can still play a significant role both in producing a transcript and creating captions from an existing transcript.
The first step in captioning multimedia is creating a transcript of the audio content. Speech recognition technology has become a widely used tool for transcriptionists. In a process called shadow speaking, the transcriptionist (who has trained the speech recognition software to understand his or her speech) simply speaks along with the audio, repeating what the speaker is saying. Products like the CPC-500 Voice Captioner™ support the shadow speaking process for real-time captioning of live events. Transcriptionists who are creating transcripts to be converted into captions will typically use an off-the-shelf speech recognition product such as Dragon NaturallySpeaking.
If a transcript already exists, products or services like CaptionSync™ by Automatic Sync Technologies can effectively use speech recognition to create captions from the existing transcript. This is possible, whereas fully automated captioning is not, because the speech recognition engine only needs to identify when a known word or phrase was spoken, which is a much easier task than identifying what what was spoken. CaptionSync is provided as a web-based service, where customers upload a video file and transcript, and within minutes receive a caption file via email.
Copyright © 2002 - 2009 by University of Washington. Permission is granted to copy these materials for educational, noncommercial purposes provided the source is acknowledged. For more information see the larger AccessIT Copyright Statement. AccessIT was funded by the National Institute on Disability and Rehabilitation Research of the U.S. Department of Education (grant #H133D010306) through September 30, 2006; it is now maintained with funding from the National Science Foundation (grant #CNS-054061S). The contents do not necessarily represent the policies of the U.S. federal government, and you should not assume their endorsement.