scispace - formally typeset
Search or ask a question

Showing papers on "Closed captioning published in 2012"


Proceedings ArticleDOI
07 Oct 2012
TL;DR: This paper introduces a new approach in which groups of non-expert captionists (people who can hear and type) collectively caption speech in real-time on-demand, and presents Legion:Scribe, an end-to-end system that allows deaf people to request captions at any time.
Abstract: Real-time captioning provides deaf and hard of hearing people immediate access to spoken language and enables participation in dialogue with others. Low latency is critical because it allows speech to be paired with relevant visual cues. Currently, the only reliable source of real-time captions are expensive stenographers who must be recruited in advance and who are trained to use specialized keyboards. Automatic speech recognition (ASR) is less expensive and available on-demand, but its low accuracy, high noise sensitivity, and need for training beforehand render it unusable in real-world situations. In this paper, we introduce a new approach in which groups of non-expert captionists (people who can hear and type) collectively caption speech in real-time on-demand. We present Legion:Scribe, an end-to-end system that allows deaf people to request captions at any time. We introduce an algorithm for merging partial captions into a single output stream in real-time, and a captioning interface designed to encourage coverage of the entire audio stream. Evaluation with 20 local participants and 18 crowd workers shows that non-experts can provide an effective solution for captioning, accurately covering an average of 93.2% of an audio stream with only 10 workers and an average per-word latency of 2.9 seconds. More generally, our model in which multiple workers contribute partial inputs that are automatically merged in real-time may be extended to allow dynamic groups to surpass constituent individuals (even experts) on a variety of human performance tasks.

213 citations


Patent
22 Jun 2012
TL;DR: In this paper, real-time metadata tracks recorded to media streams allow search and analysis operations in a variety of contexts and allow insertion of content specific advertising during appropriate portions of a media stream based on the content of the metadata tracks.
Abstract: Real-time metadata tracks recorded to media streams allow search and analysis operations in a variety of contexts. Search queries can be performed using information in real-time metadata tracks such as closed captioning, sub-title, statistical tracks, miscellaneous data tracks. Media streams can also be augmented with additional tracks. The metadata tracks not only allow efficient searching and indexing, but also allow insertion of content specific advertising during appropriate portions of a media stream based on the content of the metadata tracks.

36 citations


Proceedings ArticleDOI
16 Apr 2012
TL;DR: This approach couples the usage of off-the-shelf ASR (Automatic Speech Recognition) software with a novel caption alignment mechanism that smartly introduces unique audio markups into the audio stream before giving it to the ASR and transforms the plain transcript produced by theASR into a timecoded transcript.
Abstract: The simple act of listening or of taking notes while attending a lesson may represent an insuperable burden for millions of people with some form of disabilities (e.g., hearing impaired, dyslexic and ESL students). In this paper, we propose an architecture that aims at automatically creating captions for video lessons by exploiting advances in speech recognition technologies. Our approach couples the usage of off-the-shelf ASR (Automatic Speech Recognition) software with a novel caption alignment mechanism that smartly introduces unique audio markups into the audio stream before giving it to the ASR and transforms the plain transcript produced by the ASR into a timecoded transcript.

36 citations


Proceedings ArticleDOI
29 Feb 2012
TL;DR: The development and evaluation of ICS videos framework and assessment of its value as an academic learning resource are reported on.
Abstract: Videos of classroom lectures have proven to be a popular and versatile learning resource. This paper reports on videos featuring Indexing, Captioning, and Search capability (ICS Videos). The goal is to allow a user to rapidly search and access a topic of interest, a key shortcoming of the standard video format. A lecture is automatically divided into logical indexed video segments by analyzing video frames. Text is automatically identified with OCR technology enhanced with image transformations to drive keyword search. Captions can be added to videos. The ICS video player integrates indexing, search, and captioning in video playback and has been used by dozens of courses and 1000s of students. This paper reports on the development and evaluation of ICS videos framework and assessment of its value as an academic learning resource.

32 citations


Journal ArticleDOI
TL;DR: This article considers the current state of closed captioning for online videos, in the U.S. context, and argues that captions and deafness have long been associated with the private, complicating their advancement under civil rights laws concerned with the public sphere and facilitating advancement through telecommunications laws and notions of consumer choice.
Abstract: This article considers the current state of closed captioning for online videos, in the U.S. context. As media access is foundational to cultural citizenship, captions and similar accessibility fea...

30 citations


Proceedings ArticleDOI
22 Oct 2012
TL;DR: This paper presents methods for quickly identifying workers who are producing good partial captions and estimating the quality of their input and evaluates these methods in experiments run on Mechanical Turk.
Abstract: Approaches for real-time captioning of speech are either expensive (professional stenographers) or error-prone (automatic speech recognition). As an alternative approach, we have been exploring whether groups of non-experts can collectively caption speech in real-time. In this approach, each worker types as much as they can and the partial captions are merged together in real-time automatically. This approach works best when partial captions are correct and received within a few seconds of when they were spoken, but these assumptions break down when engaging workers on-demand from existing sources of crowd work like Amazon's Mechanical Turk. In this paper, we present methods for quickly identifying workers who are producing good partial captions and estimating the quality of their input. We evaluate these methods in experiments run on Mechanical Turk in which a total of 42 workers captioned 20 minutes of audio. The methods introduced in this paper were able to raise overall accuracy from 57.8% to 81.22% while keeping coverage of the ground truth signal nearly unchanged.

29 citations


Proceedings ArticleDOI
22 Oct 2012
TL;DR: This study asked 48 deaf and hearing readers to evaluate transcripts produced by a professional captionist, ASR and crowd captioning software respectively and found the readers preferred crowd captions over professional captions and ASR.
Abstract: Deaf and hard of hearing individuals need accommodations that transform aural to visual information, such as captions that are generated in real-time to enhance their access to spoken information in lectures and other live events. The captions produced by professional captionists work well in general events such as community or legal meetings, but is often unsatisfactory in specialized content events such as higher education classrooms. In addition, it is hard to hire professional captionists, especially those that have experience in specialized content areas, as they are scarce and expensive. The captions produced by commercial automatic speech recognition (ASR) software are far cheaper, but is often perceived as unreadable due to ASR's sensitivity to accents, background noise and slow response time. We ran a study to evaluate the readability of captions generated by a new crowd captioning approach versus professional captionists and ASR. In this approach, captions are typed by classmates into a system that aligns and merges the multiple incomplete caption streams into a single, comprehensive real-time transcript. Our study asked 48 deaf and hearing readers to evaluate transcripts produced by a professional captionist, ASR and crowd captioning software respectively and found the readers preferred crowd captions over professional captions and ASR.

25 citations


Patent
31 Dec 2012
TL;DR: In this article, the system accepts captioning data and determines a number of errors in the caption data, as well as the number of words per minute across the entirety of an event corresponding to the captioning and time intervals of the event.
Abstract: A captioning evaluation system. The system accepts captioning data and determines a number of errors in the captioning data, as well as the number of words per minute across the entirety of an event corresponding to the captioning data and time intervals of the event. The errors may be used to determine the accuracy of the captioning and the words per minute, both for the entire event and the time intervals, used to determine a cadence and/or rhythm for the captioning. The accuracy and cadence may be used to score the captioning data and captioner.

24 citations


Journal ArticleDOI
TL;DR: The findings do suggest that the images acted as procedural facilitators, triggering recall of vocabulary and details, in individuals who are d/Deaf and hard of hearing.
Abstract: Using a nonexperimental design , the researchers explored the effect of captioning as part of the writing process of individuals who are d/Deaf and hard of hearing Sixty-nine d/Deaf and hard of hearing middle school students composed responses to four writing-to-learn activities in a word processor Two compositions were revised and published with software that displayed texts as captions to digital images; two compositions were revised with a word processor and published on paper Analysis showed increases in content-area vocabulary, text length, and inclusion of main ideas and details for texts revised in the captioning software Given the nonexperimental design, it is not possible to determine the extent to which the results could be attributed to captioned revisions However, the findings do suggest that the images acted as procedural facilitators, triggering recall of vocabulary and details

17 citations


Patent
20 Nov 2012
TL;DR: In this paper, Cue points are developed with respect to a video and enhancement information is aligned with the cue points such that the cue point and the enhancement information may be maintained separate from the video and applied to any version of a video.
Abstract: Methods and apparatus are presented for providing enhancement information associated video, for example subtitles or closed captions. Cue points are developed with respect to a video and enhancement information is aligned with the cue points such that the cue point and enhancement information may be maintained separate from the video and applied to any version of a video. Some disclosed embodiments relate to using groups of volunteers to provide and edit enhancement information in a five stage process. The volunteer groups may be operated in a crowd sourcing fashion.

15 citations


Patent
09 Jan 2012
TL;DR: In this article, a method for generating a plurality of fragments based on the text is presented, which are then used to convert the video data into a second format to be provided as an output, which is based on video data that was received.
Abstract: A method is provided in one example and includes receiving video data from a video source in a first format, where the video data includes associated text to be overlaid on the video data as part of a video stream. The method also includes generating a plurality of fragments based on the text. The fragments include respective regions having a designated time duration. The method also includes using the plurality of fragments to convert the video data into a second format to be provided as an output, which is based on the video data that was received. In more specific embodiments, the first format is associated with a Paint-On caption or a Roll-Up caption, and the second format is associated with a Pop-On caption. The first format can also be associated with subtitles.

Patent
19 Dec 2012
TL;DR: In this paper, a system and a method for using the system for targeted commerce in network broadcasting are provided, which includes an interface device configured to receive a multimedia stream from a network, wherein the multimedia stream includes a close captioning string and wherein the interface device is further configured to process the multimedia streaming by providing advertisements in the multimedia streams according to a correlation between the close caption string and a plurality of vendor keywords; and a viewing device configures to receive the processed multimedia stream and display to a viewer.
Abstract: A system and a method for using the system for targeted commerce in network broadcasting are provided. The system includes an interface device configured to receive a multimedia stream from a network, wherein the multimedia stream includes a close captioning string and wherein the interface device is further configured to process the multimedia stream by providing advertisements in the multimedia stream according to a correlation between the close captioning string and a plurality of vendor keywords; and a viewing device configured to receive the processed multimedia stream and display to a viewer.

Proceedings Article
11 Nov 2012
TL;DR: This demo enables the automatic creation of semantically annotated YouTube media fragments by first ingested in the Synote system and then NERD is used to extract named entities from the transcripts which are then temporally aligned with the video.
Abstract: This demo enables the automatic creation of semantically annotated YouTube media fragments. A video is first ingested in the Synote system and a new method enables to retrieve its associated subtitles or closed captions. Next, NERD is used to extract named entities from the transcripts which are then temporally aligned with the video. The entities are disambiguated in the LOD cloud and a user interface enables to browse through the entities detected in a video or get more information. We evaluated our application with 60 videos from 3 YouTube channels.

Proceedings ArticleDOI
09 Sep 2012
TL;DR: The concept of respeaking using only one re-speaker with enhanced re- Speakers tasks fully integrated to the recognition system and captioning software is described and three-level evaluation method of final re-Speaker’s skills is proposed.
Abstract: A novel approach to the live captioning through re-speaking is introduced in this paper. We describe our concept of respeaking using only one re-speaker with enhanced re-speaker tasks fully integrated to the recognition system and captioning software. New techniques for instant correction of recognition output, punctuation mark introduction or new word addition are presented. Our real-time recognition system of the Czech language with a vocabulary containing more than one million words is described and an architecture of captioning system that we operate is illustrated. Last part of the paper is dedicated to the re-speaker training methodology and a three-level evaluation method of final re-speaker’s skills is proposed.

Proceedings ArticleDOI
13 Nov 2012
TL;DR: The experimental results show that the proposed method for constructing N-gram language model based on multi word expressions from web retrieval results can improve the recognition performance and the closed-captioning accuracy.
Abstract: Automatic speech recognition technique is generally used to align the closed caption text to video data. It is important to increase the speech recognition accuracy for the accurate closed-captioning. This paper proposes the method for constructing N-gram language model based on multi word expressions (MWEs) from web retrieval results to improve the speech recognition performance. The web retrieval experiment for examining the distribution of web count numbers for MWEs and the speech recognition experiment for investigating the effectiveness of MWEs are conducted. The experimental results show that the proposed method can improve the recognition performance and the closed-captioning accuracy.

Proceedings ArticleDOI
04 Jul 2012
TL;DR: Three new important enhancements to Synote are explained, Crowdsourcing correction of speech recognition errors allows for sustainable captioning of the lecture while the development of an integrated mobile speech recognition application enables synchronized live verbal contributions from the class to also be captured through captions.
Abstract: This paper explains three new important enhancements to Synote, the freely available, award winning, open source, web based application that makes web hosted recordings easier to access, search, manage, and exploit for learners, teachers and other users. Synote uniquely achieves this through the creation of synchronized notes, bookmarks, tags, links, images and text captions, enabling users to easily find, or associate their notes or resources with, any part of a recording available on the web. Students surveyed would like to be able to access all their lectures through Synote. The facility to convert and import narrated PowerPoint PPTX files means that teachers can capture their lectures without requiring institution-wide expensive lecture capture systems. Crowdsourcing correction of speech recognition errors allows for sustainable captioning of the lecture while the development of an integrated mobile speech recognition application enables synchronized live verbal contributions from the class to also be captured through captions.

Patent
28 Dec 2012
TL;DR: In this article, a content receiver detects that a volume of an audio of a video presentation has been adjusted by a user, and determines the adjusted audio volume level that results from the adjustment.
Abstract: Methods and apparatus are provided for a control of closed captioning based on an audio volume level. A content receiver detects that a volume of an audio of a video presentation has been adjusted by a user, and determines the adjusted audio volume level that results from the adjustment. The content receiver compares the resulting adjusted audio volume level to a threshold level. When the content determines that the adjusted audio volume level is under the threshold level, it enables closed captioning of the video presentation, thus presenting the user with both audio and closed captioning. When the content receiver determines that the adjusted audio volume level is above the threshold level, it disables closed captioning for the video presentation. The content receiver may use a microphone to determine the adjusted audio volume level.

Patent
24 Apr 2012
TL;DR: In this paper, closed captioning information can be toggled on/off using menu options and preferences as well as automatically managed by intelligently monitoring the environment surrounding a device.
Abstract: Media content typically includes closed captioning information such as subtitles in domestic and foreign languages. Techniques and mechanisms provide that closed captioning information may be toggled on/off using menu options and preferences as well as automatically managed by intelligently monitoring the environment surrounding a device. Device sensors such as microphones and vibration monitors determine the noise level of an environment as well as the spectral characteristics of the noise to determine whether the noise profile would interfere with the video playback experience. A particular environmental noise profile could automatically trigger the display of closed captioning information or present an easy access, otherwise unavailable toggle to display closed captioning information associated with a video stream.

Book ChapterDOI
03 Sep 2012
TL;DR: This paper describes the recognition system design with advanced re-speaker interaction, distributed captioning system architecture and neglected re-Speaker training, and some evaluation of skilled re- Speakers is presented.
Abstract: In this paper we introduce our complete solution for captioning of live TV programs used by the Czech Television, the public service broadcaster in the Czech Republic. Live captioning using speech recognition and re-speaking is on the increase and widely used for example in BBC; however, many specific issues have to be solved each time a new captioning system is being put in operation. Our concept of re-speaking assumes a complex integration of re-speaker’s skills, not only verbatim repetition with fully automatic processing. This paper describes the recognition system design with advanced re-speaker interaction, distributed captioning system architecture and neglected re-speaker training. Some evaluation of our skilled re-speakers is presented too.

Patent
13 Mar 2012
TL;DR: In this article, an approach for providing synchronized playback of media streams and corresponding closed captions is described, where one or more portions of a media stream and closed caption data is received, at a virtual video server resident on a user device, from an external video server.
Abstract: An approach for providing synchronized playback of media streams and corresponding closed captions is described. One or more portions of a media stream and corresponding closed caption data is received, at a virtual video server resident on a user device, from an external video server. The one or more portions of the media stream and the corresponding closed caption data is buffered by the virtual video server. The one or more portions of the media stream is delivered to a video player application and the corresponding closed caption data is delivered to a rendering application as to synchronize playback of the one or more portions of the media stream and the corresponding closed caption data by the respective applications, wherein the video player application and the rendering application are resident on the user device.

Patent
21 Sep 2012
TL;DR: In this paper, the automatic processing and indexing of video and audio source files including the automatic generation of and maintenance of video, audio, concordance, text and closed caption files corresponding to the media content of the source files.
Abstract: The automatic processing and indexing of video and audio source files including the automatic generation of and maintenance of video, audio, concordance, text and closed caption files corresponding to the media content of the source files. Generating and maintaining the files in such a way that the content of these files remains aligned so that the timing synchronization of the audio, the video, the text and closed caption information during play back is strictly maintained, even after text is edited and/or translated to another language.

Patent
30 Dec 2012
TL;DR: In this article, a method and apparatus for content augmentation in an audio video system is presented, which concerns storing embedded data, such as close captioning or metadata, and displaying that embedded data concerning a past event in response to a user request.
Abstract: The present invention concerns a method and apparatus for content augmentation in an audio video system. In particular, the invention concerns storing embedded data, such as close captioning or metadata, and displaying that embedded data concerning a past event in response to a user request. The user request way be received from a remote control, via voice recognition, or facial recognition. In addition, the apparatus is operative to facilitate the viewer to scroll through buffered embedded data independent of any video being displayed. Thus the viewer may review closed captioning information for video which had previously been displayed.

Patent
27 Apr 2012
TL;DR: In this article, closed captioning, social media content, and tags associated with various media segments are analyzed to allow identification of particular entities depicted in the different media segments, and image recognition and audio recognition algorithms are also performed to further identify entities or validate results from the analysis of metadata.
Abstract: Mechanisms are provided to allow for improved media content navigation. Metadata such as closed captioning, social media content, and tags associated with various media segments are analyzed to allow identification of particular entities depicted in the various media segments. Image recognition and audio recognition algorithms can also be performed to further identify entities or validate results from the analysis of metadata.

Proceedings ArticleDOI
Ira R. Forman1, Ben J. Fletcher1, John G. Hartley1, Bill Rippon1, Allen Keith Wilson1 
22 Oct 2012
TL;DR: The system that was developed for personal computers is explained and the experiments to include mobile devices and multi-participant meeting rooms are described.
Abstract: Blue Herd is a project in IBM Research to investigate automated captioning for videoconferences. Today videoconferences are held among meeting participants connected with a variety of devices: personal computers, mobile devices, and multi-participant meeting rooms. Blue Herd is charged with studying automated real-time captioning in that context. This poster explains the system that was developed for personal computers and describes our experiments to include mobile devices and multi-participant meeting rooms.

01 Jan 2012
TL;DR: Results indicated that while captions may aid one in comprehension, they also tend to limit one’s interpretations, reaffirming the nature of written language as an authoritative source of information.
Abstract: This paper investigates the effectiveness of closed captioning in aiding Saudi students who are learning ESL (English as a second language). Research was carried out in a qualitative manner, and participants were 12 Saudi students pursuing their studies at Indiana University of Pennsylvania, USA (IUP). Participants in the study were asked to compose a narrative after viewing a 5-minute film segment, both with and without captioning. Their responses were then analyzed, and results indicated that while captions may aid one in comprehension, they also tend to limit one’s interpretations, reaffirming the nature of written language as an authoritative source of information.

Book ChapterDOI
11 Jul 2012
TL;DR: Three new important enhancements to Synote, the freely available, award winning, open source, web based application that makes web hosted recordings easier to access, search, manage, and exploit for learners, teachers and other users are explained.
Abstract: This paper explains three new important enhancements to Synote, the freely available, award winning, open source, web based application that makes web hosted recordings easier to access, search, manage, and exploit for learners, teachers and other users. The facility to convert and import narrated PowerPoint PPTX files means that teachers can capture and caption their lectures without requiring institution-wide expensive lecture capture or captioning systems. Crowdsourcing correction of speech recognition errors allows for sustainable captioning of any originally uncaptioned lecture while the development of an integrated mobile speech recognition application enables synchronized live verbal contributions from the class to also be captured through captions.

Patent
12 Nov 2012
TL;DR: In this paper, a closed caption content search and ranking system was proposed to allow for content discovery using closed caption search and search mechanisms, which analyze titles, descriptions, social media content, metadata, etc., and intelligently organize content for presentation to a viewer.
Abstract: Mechanisms are provided to allow for content discovery using closed caption content search and ranking. Search mechanisms analyze titles, descriptions, social media content, metadata, etc., and intelligently organize content for presentation to a viewer. Image recognition and audio recognition algorithms can also be performed to further identify entities or validate results from the analysis of metadata. Other closed captioning content may be analyzed to determine the relevance of a piece of media content to a particular search term found in the piece of media content. Results are ranked based on the prominence of search and related terms in titles, descriptions, and closed caption contents along with the popularity of the media content itself.

Proceedings ArticleDOI
20 Mar 2012
TL;DR: The paper discusses design and implementation issues for several modules like video scaler, video captioning and also the generation of video outputs signals (VGA or composite PAL-M) and implementation results using a FPGA-based hardware platform.
Abstract: In this paper a video processing architecture for use in a set top box (STB) compatible with the Brazilian Digital Television System (SBTVD) is presented. After the decoding process, a video frame is stored in the STB memory and is scanned by the output subsystem while executing several operations in order to fit the external display. The paper discusses design and implementation issues for several modules like video scaler, video captioning and also the generation of video outputs signals (VGA or composite PAL-M). Implementation results using a FPGA-based hardware platform are also provided. The goal is to go to silicon implementation after the FPGA validation phase.

Patent
19 Oct 2012
TL;DR: In this article, the authors propose a system for providing a media content item to a user, where the user profile information associated with the user of the device is stored on the device, and a supplementary information requester is used to transmit the identifying data which identifies the media content items, the user profiles and a request for supplementary information related to the media contents to a server over a network.
Abstract: A device 10 for providing a media content item to a user comprises: a media content item receiver operable to receive the media content item as a data stream MED, where the data stream may be a TV programme which comprises identifying data. The device stores user profile information associated with the user of the device, and further includes a supplementary information requester operable to transmit the identifying data which identifies: the media content item, the user profile information, and a request for supplementary information related to the media content item to a server over a network. The device receives supplementary information SI related to the media content item from the server over the network, such as sub-titles, closed captioning or foreign language dubbing. The type of supplementary information is determined by the server on the basis of the transmitted identifying data which identifies the media content item and the transmitted user profile information. The received supplementary information is transmitted to a separate personal user device 50, such as headphones 50C, a tablet computer 50B or glasses 50A, as specified by the stored user profile information.