scispace - formally typeset
Search or ask a question

Showing papers on "Closed captioning published in 2013"


Proceedings ArticleDOI
01 Dec 2013
TL;DR: Recent improvements to the original YouTube automatic generation of closed captions system are described, in particular the use of owner-uploaded video transcripts to generate additional semi-supervised training data and deep neural networks acoustic models with large state inventories.
Abstract: YouTube is a highly visited video sharing website where over one billion people watch six billion hours of video every month. Improving accessibility to these videos for the hearing impaired and for search and indexing purposes is an excellent application of automatic speech recognition. However, YouTube videos are extremely challenging for automatic speech recognition systems. Standard adapted Gaussian Mixture Model (GMM) based acoustic models can have word error rates above 50%, making this one of the most difficult reported tasks. Since 2009, YouTube has provided automatic generation of closed captions for videos detected to have English speech; the service now supports ten different languages. This paper describes recent improvements to the original system, in particular the use of owner-uploaded video transcripts to generate additional semi-supervised training data and deep neural networks acoustic models with large state inventories. Applying an “island of confidence” filtering heuristic to select useful training segments, and increasing the model size by using 44,526 context dependent states with a low-rank final layer weight matrix approximation, improved performance by about 13% relative compared to previously reported sequence trained DNN results for this task.

230 citations


Journal ArticleDOI
TL;DR: The study tested a keyword captioning method based on the hypothesis that keyword captions present learners with less to read without attenuating their comprehension of the information in the spoken message, and confirmed the positive effect of both keyword and full text captions on comprehension.
Abstract: The common practice in captioning video programs for foreign language instruction is to transcribe the spoken language verbatim into captions. This practice presents a dense visual channel for foreign language learners. The study presented here tested a keyword captioning method based on the hypothesis that keyword captions present learners with less to read without attenuating their comprehension of the information in the spoken message. The design of the experiment was simple; the use of three different amounts of text on video were compared: full text, keywords, and no text. The results of the experiment showed that the keyword captions group outperformed the no-text group and that the full text captions group outperformed the keyword captions group; however, a post-hoc analysis revealed no significant difference between the means of the full text captions group and the keyword captions group. The positive effect of both keyword and full text captions on comprehension, the basic research hypothesis, is confirmed.

152 citations


Journal ArticleDOI
TL;DR: This article investigated the extent to which the relationship between the native and target language affects the caption reading behavior by foreign language learners and found that time spent on captions differed significantly by language.
Abstract: This study investigates caption-reading behavior by foreign language (L2) learners and, through eye-tracking methodology, explores the extent to which the relationship between the native and target language affects that behavior. Second-year (4th semester) English-speaking learners of Arabic, Chinese, Russian, and Spanish watched 2 videos differing in content familiarity, each dubbed and captioned in the target language. Results indicated that time spent on captions differed significantly by language: Arabic learners spent more time on captions than learners of Spanish and Russian. A significant interaction between language and content familiarity occurred: Chinese learners spent less time on captions in the unfamiliar content video than the familiar, while others spent comparable times on each. Based on dual-processing and cognitive load theories, we posit that the Chinese learners experienced a split-attention effect when verbal processing was difficult and that, overall, captioning benefits during the 4th semester of language learning are constrained by L2 differences, including differences in script, vocabulary knowledge, concomitant L2 proficiency, and instructional methods. Results are triangulated with qualitative findings from interviews. [ABSTRACT FROM AUTHOR]

137 citations


Journal ArticleDOI
TL;DR: Two distinct methods of SR-mediated lecture acquisition, real-time captioning and postlecture transcription, were evaluated in situ life and social sciences lecture courses employing typical classroom equipment to assist students to automatically convert oral lectures into text.
Abstract: Speech recognition (SR) technologies were evaluated in different classroom environments to assist students to automatically convert oral lectures into text. Two distinct methods of SR-mediated lecture acquisition (SR-mLA), real-time captioning (RTC) and postlecture transcription (PLT), were evaluated in situ life and social sciences lecture courses employing typical classroom equipment. Both methods were compared according to technical feasibility and reliability of classroom implementation, instructors' experiences, word recognition accuracy, and student class performance. RTC provided near-instantaneous display of the instructor's speech for students during class. PLT employed a user-independent SR algorithm to optimally generate multimedia class notes with synchronized lecture transcripts, instructor audio, and class PowerPoint slides for students to access online after class. PLT resulted in greater word recognition accuracy than RTC. During a science course, students were more likely to take optional online quizzes and received higher quiz scores with PLT than when multimedia class notes were unavailable. Overall class grades were also higher when multimedia class notes were available. The potential benefits of SR-mLA for students who have difficulty taking notes accurately and independently were discussed, particularly for nonnative English speakers and students with disabilities. Field-tested best practices for optimizing SR accuracy for both SR-mLA methods were outlined.

72 citations


Patent
19 Aug 2013
TL;DR: In this paper, a method for collaboratively captioning streamed media is proposed, which includes rendering a visual representation of the audio at a first device, receiving segment parameters for a first media segment from the first device.
Abstract: A method for collaboratively captioning streamed media, the method including: rendering a visual representation of the audio at a first device, receiving segment parameters for a first media segment from the first device, rendering the visual representation of the audio at a second device, the second device different from the first device, and receiving a caption for the first media segment from the second device.

50 citations


Journal ArticleDOI
01 May 2013-ReCALL
TL;DR: Through employment of the CRT, instructors are able to evaluate the degree to which learners rely on the caption supports and thus make informed decisions regarding learners’ requirements and utilization of captions as a multimedia learning support.
Abstract: Listening comprehension in a second language (L2) is a complex and particularly challenging task for learners. Because of this, L2 learners and instructors alike employ different learning supports as assistance. Captions in multimedia instruction readily provide support and thus have been an ever-increasing focus of many studies. However, captions must eventually be removed, as the goal of language learning is participation in the target language where captions are not typically available. Consequently, this creates a dilemma particularly for language instructors as to the usage of captioning supports, as early removal may cause frustration, while late removal may create learning interference. Accordingly, the goal of the current study was to propose and employ a testing instrument, the Caption Reliance Test (CRT), which evaluates individual learners’ reliance on captioning in second language learning environments; giving a clear indication of the learners’ reliance on captioning, mirroring their support needs. Thus, the CRT was constructed comprised of an auditory track, accompanied by congruent textual captions, as well as particular incongruent textual words, to provide a means for testing. It was subsequently employed in an empirical study involving English as a Foreign Language (EFL) high school students. The results exhibited individual variances in the degree of reliance and, more importantly, exposed a negative correlation between caption reliance and L2 achievement. In other words, learners’ reliance on captions varies individually and lower-level achievers rely on captions for listening comprehension more than their high-level counterparts, indicating that learners at various comprehension levels require different degrees of caption support. Thus, through employment of the CRT, instructors are able to evaluate the degree to which learners rely on the caption supports and thus make informed decisions regarding learners’ requirements and utilization of captions as a multimedia learning support.

46 citations


Proceedings ArticleDOI
21 Oct 2013
TL;DR: A fully automatic system from raw data gathering to navigation over heterogeneous news sources, able to extract and study the trend of topics in the news and detect interesting peaks in news coverage over the life of the topic is presented.
Abstract: We present a fully automatic system from raw data gathering to navigation over heterogeneous news sources, including over 18k hours of broadcast video news, 3.58M online articles, and 430M public Twitter messages. Our system addresses the challenge of extracting "who," "what," "when," and "where" from a truly multimodal perspective, leveraging audiovisual information in broadcast news and those embedded in articles, as well as textual cues in both closed captions and raw document content in articles and social media. Performed over time, we are able to extract and study the trend of topics in the news and detect interesting peaks in news coverage over the life of the topic. We visualize these peaks in trending news topics using automatically extracted keywords and iconic images, and introduce a novel multimodal algorithm for naming speakers in the news. We also present several intuitive navigation interfaces for interacting with these complex topic structures over different news sources.

28 citations


Proceedings Article
01 Jun 2013
TL;DR: This paper describes an improved method for combining partial captions into a final output based on weighted A search and multiple sequence alignment (MSA), which outperforms the current state-of-the-art on Word Error Rate, BLEU Score, and F-measure.
Abstract: The primary way of providing real-time captioning for deaf and hard of hearing people is to employ expensive professional stenographers who can type as fast as natural speaking rates. Recent work has shown that a feasible alternative is to combine the partial captions of ordinary typists, each of whom types part of what they hear. In this paper, we describe an improved method for combining partial captions into a final output based on weighted A search and multiple sequence alignment (MSA). In contrast to prior work, our method allows the tradeoff between accuracy and speed to be tuned, and provides formal error bounds. Our method outperforms the current state-of-the-art on Word Error Rate (WER) (29.6%), BLEU Score (41.4%), and F-measure (36.9%). The end goal is for these captions to be used by people, and so we also compare how these metrics correlate with the judgments of 50 study participants, which may assist others looking to make further progress on this problem.

23 citations


Patent
24 May 2013
TL;DR: In this paper, a real-time captioning system for speech is described, where workers are asked to transcribe all or portions of what they receive and then the transcriptions received from each worker are aligned and combined to create a resulting caption.
Abstract: Methods and systems for captioning speech in real-time are provided. Embodiments utilize captionists, who may be non-expert captionists, to transcribe a speech using a worker interface. Each worker is provided with the speech or portions of the speech, and is asked to transcribe all or portions of what they receive. The transcriptions received from each worker are aligned and combined to create a resulting caption. Automated speech recognition systems may be integrated by serving in the role of one or more workers, or integrated in other ways. Workers may work locally (able to hear the speech) and/or workers may work remotely, the speech being provided to them as an audio stream. Worker performance may be measured and used to provide feedback into the system such that overall performance is improved.

18 citations


Proceedings ArticleDOI
21 Oct 2013
TL;DR: The Crowd Caption Correction (CCC) feature (and service) addresses this issue by allowing meeting participants or third party individuals to make corrections to captions in realtime during a meeting.
Abstract: Captions can be found in a variety of media, including television programs, movies, webinars and telecollaboration meetings. Although very helpful, captions sometimes have errors, such as misinterpretations of what was said, missing words and misspellings of technical terms and proper names. Due to the labor intensive nature of captioning, caption providers may not have the time or in some cases, the background knowledge of meeting content, that would be needed to correct errors in the captions. The Crowd Caption Correction (CCC) feature (and service) addresses this issue by allowing meeting participants or third party individuals to make corrections to captions in realtime during a meeting. Additionally, the feature also uses the captions to create a transcript of all captions broadcast during the meeting, which users can save and reference both during the meeting and at a later date. The feature will be available as a part of the Open Access Tool Tray System (OATTS) suite of open source widgets developed under the University of Wisconsin-Madison Trace Center Telecommunications RERC. The OATTS suite is designed to increase access to information during telecollaboration for individuals with a variety of disabilities.

16 citations


Journal ArticleDOI
TL;DR: An architecture for machine translation to Brazilian sign language (LIBRAS) and its integration, implementation and evaluation for digital TV systems, a real-time and open-domain scenario, and the preliminary evaluation indicated that the proposal is efficient, as long as its delays and bandwidth are low.
Abstract: Deaf people have serious difficulties accessing information. The support for sign language (their primary means of communication) is rarely addressed in information and communication technologies. Furthermore, there is a lack of works related to machine translation for sign language in real-time and open-domain scenarios, such as TV. To minimize these problems, in this paper, we propose an architecture for machine translation to Brazilian sign language (LIBRAS) and its integration, implementation and evaluation for digital TV systems, a real-time and open-domain scenario. The system, called LibrasTV, allows the LIBRAS windows to be generated and displayed automatically from a closed caption input stream in Brazilian Portuguese. LibrasTV also uses some strategies, such as low time consuming, text-to-gloss machine translation and LIBRAS dictionaries to minimize the computational resources needed to generate the LIBRAS windows in real-time. As a case study, we implemented a prototype of LibrasTV for the Brazilian digital TV system and performed some tests with Brazilian deaf users to evaluate it. Our preliminary evaluation indicated that the proposal is efficient, as long as its delays and bandwidth are low. In addition, as previously mentioned in the literature, avatar-based approaches are not the first choice for the majority of deaf users, who prefer human translation. However, when human interpreters are not available, our proposal is presented as a practical and feasible alternative to fill this gap.

Patent
19 Dec 2013
TL;DR: In this article, the authors present a system for allowing a user with a Smartphone to pair the Smartphone with another Bluetooth device to receive audio that is played to the user over headphones or through speakers on the smartphone.
Abstract: Systems and methods are provided for allowing a user with a Smartphone to pair the Smartphone with another Bluetooth device to receive audio that is played to the user over headphones or through speakers on the Smartphone. Further, an audio processing module is used to modify the audio presented to the user, extract closed captioning text to be displayed to the user, find information relevant to the audio to be displayed to the user, and pause audio content sent to the Smartphone when phone calls or other Smartphone interruptions occur.

Proceedings ArticleDOI
13 May 2013
TL;DR: Legion Scribe (Scribe) is introduced that allows 3-5 ordinary people who can hear and type to collectively caption speech in real-time together, while its latency and cost is dramatically lower.
Abstract: Real-time captioning provides people who are deaf or hard of hearing access to aural speech in the classroom and at live events. The only reliable approach currently is to recruit a local or remote expert stenographer who is able to type at natural speaking rates, who charge more than $100 USD per hour and must be scheduled in advance. We introduce Legion Scribe (Scribe) that allows 3-5 ordinary people who can hear and type to collectively caption speech in real-time together. Each individual is unable to type at natural speaking rates, and so each is only asked to type part of what they hear. Scribe computationally stitches the partial captions together to form a final caption stream. We have shown that the accuracy of Scribe captions approaches those of a professional stenographer, while its latency and cost is dramatically lower.

Patent
10 May 2013
TL;DR: Disclosed as mentioned in this paper is a system and method for analyzing, by a server computer, closed captioning text associated with a media program being experienced by a user having a client device.
Abstract: Disclosed is a system and method for analyzing, by a server computer, closed captioning text associated with a media program being experienced by a user having a client device. The server computer obtains, based on the analyzing, a subject matter of a portion of the media program from the closed captioning text. The server computer constructs a query associated with the determined subject matter and submits the query to a computer network as a search query. The server computer receives, in response to the submitting of the query, content relating to the subject matter and measures an elapsed time period between the receiving of the content and the obtaining of the subject matter. If the elapsed time period is less than a predetermined period of time, the server computer communicates, to the client device, information related to the content.

Patent
05 Apr 2013
TL;DR: In this article, the system and methods of processing closed captions for live streams are disclosed, where closed captioning data associated with a live video stream is represented in a first format.
Abstract: Systems and methods of processing closed captions for live streams are disclosed. For example, a media server may receive closed captioning data associated with a live video stream, where the closed captioning data is represented in a first format. The media server may convert the closed captioning data from the first format to a platform-independent format and convert the closed captioning data from the platform-independent format to a second format. The media server may transmit the closed captioning data in the second format to a destination device.

Patent
04 Oct 2013
TL;DR: In this article, techniques for live-writing and editing closed captions are described. But these techniques may be realized as a method for generating captions for a live broadcast comprising the steps of receiving audio data, the audio data corresponding to words spoken as part of the live broadcast; analyzing audio data with speech recognition software in order to generate unedited captions; and generating edited captions from the unedited caption, wherein the edited captIONS reflect edits made by a user.
Abstract: Techniques for live-writing and editing closed captions are disclosed. In one particular embodiment, the techniques may be realized as a method for generating captions for a live broadcast comprising the steps of receiving audio data, the audio data corresponding to words spoken as part of the live broadcast; analyzing the audio data with speech recognition software in order to generate unedited captions; and generating edited captions from the unedited captions, wherein the edited captions reflect edits made by a user. All of these steps may be performed during the live broadcast.

01 Jan 2013
TL;DR: An approach that enables video scenes classification and retrieving based on the Arabic closed-caption text that is present in the video and shows that the proposed framework is efficient for retrieving Arabic videos and also for classifying Arabic video scenes into a set of eight predefined semantic categories.
Abstract: Vast volumes of digital video data are generated recently in our daily life. One of the most challenging problems is classifying and retrieving the desired information from huge collections of digital video. Consequently, the closed caption text has been utilized as an alternative to enhance the video retrieval and classification. Some systems are designed based on English closed caption however results have shown that Arabic is not lucky as English and other European languages in the research. This paper adopts an approach that enables video scenes classification and retrieving based on the Arabic closed-caption text that is present in the video. Experiments are performed over prepared dataset collected from Arabic news videos and Arabic documentary films across different Arabic channels. The results show that the proposed framework is efficient for retrieving Arabic videos and also for classifying Arabic video scenes into a set of eight predefined semantic categories including politics, economics, sports, religion, social, tourism, weather, and health.

Patent
24 Sep 2013
TL;DR: In this article, the authors describe methods and devices for allowing users to use portable computer devices such as smart phones to share microphone signals and/or closed captioning text generated by speech recognition processing of the microphone signals.
Abstract: Methods and devices are described for allowing users to use portable computer devices such as smart phones to share microphone signals and/or closed captioning text generated by speech recognition processing of the microphone signals. Under user direction, the portable devices exchange messages to form a signal sharing group to facilitate their conversation.

01 Feb 2013
TL;DR: Material for 15 languages will be created, including English, Spanish and Portuguese, but focus is placed on less widely taught languages, namely Estonian, Greek, Romanian and Polish, as well as minority languages, i.e. Basque, Catalan and Irish.
Abstract: Using audiovisual material in the foreign language classroom is a common resource for teachers since it introduces variety, provides exposure to nonverbal cultural elements and, most importantly, presents linguistic and cultural aspects of communication in their context. However, teachers using this resource face the difficulty of finding active tasks that will engage learners and discourage passive viewing. One way of working with AV material in a productive and motivating way is to ask learners to revoice or caption a clip. Revoicing refers to adding voice to a clip, such as dubbing, free commentary, audio description and karaoke singing. Captioning refers to adding written text to a clip, such as standard subtitles, annotations and intertitles. Clips can be short video or audio files, including documentaries, film scenes, news pieces, animations and songs. ClipFlair develops materials which enable foreign language learners to practice all four standard CEFR skills: writing, speaking, listening and reading. ClipFlair also defines audiovisual-specific skills, namely watching, audiovisual speaking (i.e. revoicing) and audiovisual writing (i.e. captioning). Within the project scope, material for 15 languages will be created, including English, Spanish and Portuguese, but focus is placed on less widely taught languages, namely Estonian, Greek, Romanian and Polish, as well as minority languages, i.e. Basque, Catalan and Irish. Non-European languages, namely Arabic, Chinese, Japanese, Russian and Ukrainian are also foreseen. In the long term, the project intends to develop materials that can potentially be used by any FL learner by expanding the community to include any language, level or age. The ClipFlair platform has two main areas: the ClipFlair Studio and the Clipflair Social Network. The Studio offers the captioning and revoicing tools needed by activity authors to create activities. It is also the space where learners can practice and learn languages by using these activities. ClipFlair activities typically involve captioning and/or revoicing of clips. At the Social Network, users can find material, including activities, clips and tutorials, collaborate through groups, send feedback through forums and find information about the project. The consortium consists of ten institutions from eight European countries, with proven experience and competences to undertake the tasks in their field of expertise and to create material for 15 languages. There is a balance between experts in the three fields involved: Language Teaching, Audiovisual Translation and Accessibility, Information and Communication technologies.

Patent
05 Apr 2013
TL;DR: In this paper, the authors describe a system for processing closed captions in a video stream and a second video stream including closed caption data that is generated based on the interpreted closed caption.
Abstract: Systems and methods of processing closed captions are disclosed. For example, a media server may receive a first video stream and first closed caption data associated with the first video stream. The media server may interpret at least one command included in the first closed caption data to generate interpreted closed caption data. The media server may transmit, to a destination device, a second video stream including second closed caption data that is generated based on the interpreted closed caption data.

Patent
04 Sep 2013
TL;DR: In this paper, a capture infrastructure annotates the audio-visual data with a brand name and/or a product name by comparing entries in the master database with a closed captioning data of the audiovisual data and through an application of an optical character recognition algorithm in the audio visual data.
Abstract: A method, apparatus and system annotation of meta-data through a capture infrastructure are disclosed. In one embodiment, a method of a client device includes applying an automatic content recognition algorithm to determine a content identifier of an audio-visual data. The client device then associates the content identifier with an advertisement data based on a semantic correlation between a meta-data of the advertisement provided by a content provider and/or the content identifier. A capture infrastructure annotates the audio-visual data with a brand name and/or a product name by comparing entries in the master database with a closed captioning data of the audio-visual data and/or through an application of an optical character recognition algorithm in the audio-visual data. The content identifier may involve a music identification, an object identification, a facial identification, and/or a voice identification. A minimal functionality including accessing a tuner and/or a stream decoder that identifies a channel and/or a content may be found in the networked media device. The networked media device may produce an audio fingerprint and/or a video fingerprint that is communicated with the capture infrastructure.

Proceedings ArticleDOI
28 Jul 2013
TL;DR: IntoNow uses the microphone of the companion device to sample the audio coming from the TV set, and compares it against a database of TV shows in order to identify the program being watched and retrieves information related to the program the user is watching.
Abstract: IntoNow is a mobile application that provides a second-screen experience to television viewers. IntoNow uses the microphone of the companion device to sample the audio coming from the TV set, and compares it against a database of TV shows in order to identify the program being watched. The system we demonstrate is activated by IntoNow for specific types of shows. It retrieves information related to the program the user is watching by using closed captions, which are provided by each broadcasting network along the TV signal. It then matches the stream of closed captions in real-time against multiple sources of content. More specifically, during news programs it displays links to online news articles and the profiles of people and organizations in the news, and during music shows it displays links to songs. The matching models are machine-learned from editorial judgments, and tuned to achieve approximately 90% precision.

Patent
23 Feb 2013
TL;DR: In this article, a system and method for automatically captioning an electronic demonstration using object properties captured from the operating system is described, which is used to generate explanatory captions that are displayed to a user or trainee during the playback of the electronic demonstration.
Abstract: A system and method are disclosed for automatically captioning an electronic demonstration using object properties captured from the operating system. In response to an action that is initiated by a demonstrator, the operating system is queried to obtain the property information for the target object to which the action is directed as well as the parent object of the target object. This property information is then used to generate explanatory captions that are displayed to a user or trainee during the playback of the electronic demonstration.

Posted Content
TL;DR: In this paper, the authors perform an automatic analysis of television news programs, based on the closed captions that accompany them, and present a series of key insights about news providers, people in the news, and discuss the biases that can be uncovered by automatic means.
Abstract: We perform an automatic analysis of television news programs, based on the closed captions that accompany them. Specifically, we collect all the news broadcasted in over 140 television channels in the US during a period of six months. We start by segmenting, processing, and annotating the closed captions automatically. Next, we focus on the analysis of their linguistic style and on mentions of people using NLP methods. We present a series of key insights about news providers, people in the news, and we discuss the biases that can be uncovered by automatic means. These insights are contrasted by looking at the data from multiple points of view, including qualitative assessment.

Patent
Stephen Rys1, Dale Malik1, Nadia Morris1
23 Oct 2013
TL;DR: In this paper, a method is proposed to identify, at a computing device, multiple segments of video content based on a context sensitive term and each segment of the multiple segments is associated with captioning data of the video content.
Abstract: A method includes identifying, at a computing device, multiple segments of video content based on a context sensitive term. Each segment of the multiple segments is associated with captioning data of the video content. The method also includes determining, at the computing device, first contextual information of a first segment of the multiple segments based on a set of factors. The method further includes comparing the first contextual information to particular contextual information that corresponds to content of interest. The method further includes in response to a determination that the first contextual information matches the particular contextual information, storing a first searchable tag associated with the first segment.

Proceedings ArticleDOI
13 May 2013
TL;DR: This work presents a preliminary study of this problem, i.e., to find an online news article that matches the piece of news discussed in the newscast currently airing on TV, and display it in real-time.
Abstract: IntoNow from Yahoo! is a second screen application that enhances the way of watching TV programs. The application uses audio from the TV set to recognize the program being watched, and provides several services for different use cases. For instance, while watching a football game on TV it can show statistics about the teams playing, or show the title of the song performed by a contestant in a talent show. The additional content provided by IntoNow is a mix of editorially curated and automatically selected one. From a research perspective, one of the most interesting and challenging use cases addressed by IntoNow is related to news programs (newscasts). When a user is watching a newscast, IntoNow detects it and starts showing online news articles from the Web. This work presents a preliminary study of this problem, i.e., to find an online news article that matches the piece of news discussed in the newscast currently airing on TV, and display it in real-time.

Proceedings ArticleDOI
27 Apr 2013
TL;DR: This paper explores the effect of adaptively scaling the amount of content presented to each worker based on their past and recent performance, and uses 'one size fits all' segment durations regardless of an individual worker's ability or preferences.
Abstract: Real-time captioning provides deaf and hard of hearing users with access to live spoken language. The most common source of real-time captions are professional stenographers, but they are expensive (up to $200/hr). Recent work shows that groups of non-experts can collectively caption speech in real-time by directing workers to different portions of the speech and automatically merging the pieces together. This work uses 'one size fits all' segment durations regardless of an individual worker's ability or preferences. In this paper, we explore the effect of adaptively scaling the amount of content presented to each worker based on their past and recent performance. For instance, giving fast typists longer segments and giving workers shorter segments as they fatigue. Studies with 24 remote crowd workers, using ground truth in segment calculations, show that this approach improves average coverage by over 54%, and F1 score (harmonic mean) by over 44%.

Proceedings ArticleDOI
28 Oct 2013
TL;DR: An automatic analysis of television news programs, based on the closed captions that accompany them, collects a series of key insights about news providers, people in the news, and the biases that can be uncovered by automatic means.
Abstract: We perform an automatic analysis of television news programs, based on the closed captions that accompany them. Specifically, we collect all the news broadcasted in over 140 television channels in the US during a period of six months. We start by segmenting, processing, and annotating the closed captions automatically. Next, we focus on the analysis of their linguistic style and on mentions of people using NLP methods. We present a series of key insights about news providers, people in the news, and we discuss the biases that can be uncovered by automatic means. These insights are contrasted by looking at the data from multiple points of view, including qualitative assessment.

Journal ArticleDOI
TL;DR: In this paper, the effects of advertisement in accessible format, through the use of captioning and Indian sign language (ISL), on hearing and deaf people were investigated, and the results showed that accessible formats increased the comprehension of the message of the advertisement and use of ISL helped deaf persons to understand concepts better.
Abstract: Universal Design in Media as a strategy to achieve accessibility in digital television started in Spain in 1997 with the digitalization of satellite platforms (MuTra, 2006). In India, a conscious effort toward a strategy for accessible media format in digital television is yet to be made. Advertising in India is a billion dollar industry (Adam Smith, 2008) and digital television provides a majority of the space for it. This study investigated the effects of advertisement in accessible format, through the use of captioning and Indian sign language (ISL), on hearing and deaf people. “Deaf (capital letter ‘D’ used for culturally Deaf) and hearing” viewers watched two short recent advertisements with and without accessibility formats in a randomized order. Their reactions were recorded on a questionnaire developed for the purpose of the study. Eighty-four persons participated in this study of which 42 were deaf persons. Analysis of the data showed that there was difference in the effects of accessible and nonaccessible formats of advertisement on the “Deaf and Hearing” viewers. The study showed that accessible formats increased the comprehension of the message of the advertisement and use of ISL helped deaf persons to understand concepts better. While captioning increased the perception of the hearing persons to correlate with listening and understanding the concept of the advertisement, the deaf persons correlated watching the ISL interpreter with understanding the concept of the advertisement. Placement of the ISL interpreter in the screen and color of the fonts used for captioning were also covered under the study. However, the placement of the ISL interpreter and color of fonts in the screen and their correlation with comprehension of the advertisement by hearing and deaf persons did not show much of significance in the result of the study.

Proceedings Article
Rebecca Mason1
01 Jun 2013
TL;DR: This work presents a framework for image caption generation that does not rely on visual recognition systems, which is implemented on a dataset of online shopping images and product descriptions and proposes future work to improve this method, and extensions for other domains of images and natural text.
Abstract: Automatically describing visual content is an extremely difficult task, with hard AI problems in Computer Vision (CV) and Natural Language Processing (NLP) at its core. Previous work relies on supervised visual recognition systems to determine the content of images. These systems require massive amounts of hand-labeled data for training, so the number of visual classes that can be recognized is typically very small. We argue that these approaches place unrealistic limits on the kinds of images that can be captioned, and are unlikely to produce captions which reflect human interpretations. We present a framework for image caption generation that does not rely on visual recognition systems, which we have implemented on a dataset of online shopping images and product descriptions. We propose future work to improve this method, and extensions for other domains of images and natural text.