scispace - formally typeset
Open AccessBook ChapterDOI

Enhancing the usability of real-time speech recognition captioning through personalised displays and real-time multiple speaker editing and annotation

Reads0
Chats0
TLDR
This paper describes the development of a system that can provide an automatic text transcription of multiple speakers using speech recognition (SR), with the names of speakers identified in the transcription and corrections of SR errors made in real-time by a human 'editor'.
Abstract
Text transcriptions of the spoken word can benefit deaf people and also anyone who needs to review what has been said (e.g. at lectures, presentations, meetings etc.) Real time captioning (i.e. creating a live verbatim transcript of what is being spoken) using phonetic keyboards can provide an accurate live transcription for deaf people but is often not available because of the cost and shortage of highly skilled and trained stenographers. This paper describes the development of a system that can provide an automatic text transcription of multiple speakers using speech recognition (SR), with the names of speakers identified in the transcription and corrections of SR errors made in real-time by a human 'editor'.

read more

Content maybe subject to copyright    Report

Enhancing the Usability of Real-Time Speech
Recognition Captioning through Personalised Displays
and Real-Time Multiple Speaker Editing and Annotation
Mike Wald
1
, Keith Bain
2
1
Learning Technologies Group, School of Electronics and Computer Science, University of
Southampton, Southampton SO171BJ, United Kingdom, M.Wald@soton.ac.uk
2
Liberated Learning, Saint Mary’s University, Halifax, NS B3H 3C3, Canada,
Keithbain@stmarys.ca
Abstract. Text transcriptions of the spoken word can benefit deaf people and
also anyone who needs to review what has been said (e.g. at lectures,
presentations, meetings etc.) Real time captioning (i.e. creating a live verbatim
transcript of what is being spoken) using phonetic keyboards can provide an
accurate live transcription for deaf people but is often not available because of
the cost and shortage of highly skilled and trained stenographers. This paper
describes the development of a system that can provide an automatic text
transcription of multiple speakers using speech recognition (SR), with the
names of speakers identified in the transcription and corrections of SR errors
made in real-time by a human ‘editor’.
Keywords: Real time, captioning, speech recognition, editing, multiple
speakers, transcription
1 Introduction
Text transcriptions of the spoken word can benefit deaf people and also anyone who
needs to review what has been said (e.g. at lectures, presentations, meetings etc.) Real
time captioning (i.e. creating a live verbatim transcript of what is being spoken) using
phonetic keyboards can provide a live transcription for deaf people and can cope
accurately (e.g. >98%) with people talking at up to 240 words per minute but is often
not available because of the cost and shortage of highly skilled and trained
stenographers [1] [2]. This paper describes the development of applications that use
speech recognition to provide automatic text transcriptions.

2 Visual Indication of Pauses
Standard speech recognition (SR) software (e.g. Dragon, ViaVoice [3]) was found to
be unsuitable for live transcription of speech as without the dictation of punctuation it
produced a continuous unbroken stream of text that was very difficult to read and
comprehend. IBM and Liberated Learning (LL) therefore developed ViaScribe [4] [5]
as an SR application that automatically formats real-time text captions from live
speech with a visual indication of pauses. Detailed feedback from students with a
wide range of physical, sensory and cognitive disabilities and interviews with
lecturers [6] showed that both students and teachers felt this approach improved
teaching and learning as long as the text was reasonably accurate (e.g. >85%).
3 Personalised and Customisable Display
While projecting the text onto a large screen in the classroom has been used
successfully in LL classrooms it is clear that in many situations an individual
personalised and customisable display (e.g. font size, formatting, colour etc.) would
be preferable or essential and so a personalised server and client was developed to
enable users to customise their displays on their own networked computer [7].
4 Real-Time Editing
SR accuracy may be reduced where the original speech is not of sufficient
volume/quality (e.g. poor microphone position, telephone, internet, television,
indistinct speaker) or when the system is not trained (e.g. multiple speakers, meetings,
panels, audience questions). An experienced trained ‘re-voicer’ repeating what has
been said can sometimes improve SR readability in these situations by correcting
ASR errors if the accuracy is high and the speaking rate low and summarising what is
being said if the speaking rates are fast [8] [9]. Summarisation however requires the
re-voicer to actually understand and ‘interpret’ what is being said and therefore to
have a good knowledge of the subject.
To improve accuracy of verbatim captions created directly from the voice of the
original speaker the application RealTimeEdit (RTE) was developed to enable
corrections to ASR captions to be made in real-time [10]. One editor can find and
correct errors or the task of finding and correcting errors can be shared between two
editors, one using the mouse and the other the keyboard. It is also possible to use
multiple editors sequentially to allow a 2nd operator to correct errors that a 1st
operator didn’t have time to correct. The editor can also annotate where required (e.g.
describe sounds <<LAUGHING>> or identify mumbled and clearly incorrectly
recognised words that they cannot identify as <<INAUDIBLE>>). In this way a real-
time editor can be used in situations where high accuracy captions are required and a

real-time stenographer is not available. Up to eleven corrections per minute were
achieved by untrained users of an initial prototype of RTE and a theoretical analysis
suggested experienced touch typists could be trained to achieve over 15 corrections
per minute. Analysis of an ASR transcript with a 22% error rate also suggested that
correction of less than 20% of the ‘critical’ errors may be required to understanding
the meaning of all the captions [11]. Somebody talking at 150 words per minute with
a 22% error rate produces an average of 33 errors per minute and if correction of only
20% of these errors were ‘critical’ to understanding then the editor would have to
correct on average only about 7 errors per minute. This would suggest that even if
100% accuracy was not achievable, 100% understanding might be. Judging which
words were critical to understanding might be an easier task than the summarisation
task faced by the re-voicer.
5 Multiple Speaker Transcriptions
In situations where there is more than one person speaking, using multiple instances
of ViaScribe creates captions in multiple windows making it difficult to follow the
sequence of the utterances. To produce a transcript of the session with speakers
identified, the application RealTimeMerge (RTM) was developed to add the speaker’s
name to the text captions and merge the streams from the instances of ViaScribe.
Each speaker and instance of ViaScribe can have a separate editor and the edited
outputs merged or the unedited outputs of ViaScribe can be edited. The combination
of ViaScribe, ViaScribe server, PDC, RTE, and RTM enables a very flexible
approach to be adopted that can provide solutions to many requirements. Figures 1-7
show how the recognised text from four speakers using four instances of Viascribe
can be output via the server and merged with the speaker’s names added and then
edited for errors before the corrected transcript is displayed on one or more clients.

Fig. 1. Four Instances of ViaScribe showing the ASR text captions with errors for
four separate speakers
Fig. 2. ViaScribe Server started

Fig. 3. RealTimeMerge Input setup window
Fig. 4. The four speakers’ names and Server IP addresses and Port numbers have been added to
RealTimeMerge
Fig. 5. RealTimeEdit displaying the merged captions and names of the four speakers’ output by
RealTimeMerge

Citations
More filters
Journal ArticleDOI

Using Speech Recognition for Real-Time Captioning and Lecture Transcription in the Classroom

TL;DR: Two distinct methods of SR-mediated lecture acquisition, real-time captioning and postlecture transcription, were evaluated in situ life and social sciences lecture courses employing typical classroom equipment to assist students to automatically convert oral lectures into text.
Journal ArticleDOI

Automated generation of ‘good enough’ transcripts as a first step to transcription of audio-recorded data:

TL;DR: This article presents a proof-of-concept exploration utilising three examples of automated transcription of audio recordings from different contexts; an interview, a public hearing and a classroom setting, and concludes that this form of automation provides ‘good enough’ transcription for first versions of transcripts.
Proceedings ArticleDOI

Synchronised Annotation of Multimedia

TL;DR: The development of a web based application that makes multimedia web resources easier to access, search, manage, and exploit for learners, teachers and other users through the creation of notes, bookmarks, tags, links, images and text captions synchronized to any part of the recording.
Book ChapterDOI

E-Scribe: ubiquitous real-time speech transcription for the hearing-impaired

TL;DR: e-Scribe is presented, a prototype web-based online centre for real-time speech transcription for the hearing-impaired, which provides ubiquitous access to speech transcription utilizing contemporary communication technologies.
Book ChapterDOI

A Speech-To-Text System’s Acceptance Evaluation: Would Deaf Individuals Adopt This Technology in Their Lives?

TL;DR: The main goal was to investigate which variables most influence on the acceptance of a Speech-To-Text system with regard to the different profiles of people who are Deaf or Hard of Hearing.
References
More filters
Proceedings ArticleDOI

Speech recognition in university classrooms: liberated learning project

TL;DR: This paper addresses the intriguing questions and explores the underlying complex relationship between speech recognition technology, university educational environments, and disability issues.
Journal ArticleDOI

Speech-Based Real-Time Subtitling Services

TL;DR: How the “SpeakTitle” project met the challenges of real time speech recognition and live subtitling through the development of a customisable speaker interface and use of ‘Topics’ for specific subject domains is described.
Journal ArticleDOI

Creating accessible educational multimedia through editing automatic speech recognition captioning in real time

TL;DR: This paper describes the development of a system that enables editors to correct errors in the captions as they are created by Automatic Speech Recognition.
Journal ArticleDOI

An exploration of the potential of Automatic Speech Recognition to assist and enable receptive communication in higher education

TL;DR: The automatic provision of online lecture notes, synchronised with speech, enables staff and students to focus on learning and teaching issues, while also benefiting learners unable to attend the lecture or who find it difficult or impossible to take notes at the same time as listening, watching and thinking.

Using Automatic Speech Recognition to Assist Communication and Learning

Mike Wald, +1 more
TL;DR: The achievements and planned developments of the Liberated Learning Consortium are described to support preferred learning and teaching styles and assist those who, for cognitive, physical or sensory reasons, find notetaking difficult.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What have the authors contributed in "Enhancing the usability of real-time speech recognition captioning through personalised displays and real-time multiple speaker editing and annotation" ?

This paper describes the development of a system that can provide an automatic text transcription of multiple speakers using speech recognition ( SR ), with the names of speakers identified in the transcription and corrections of SR errors made in real-time by a human ‘ editor ’. 

Somebody talking at 150 words per minute with a 22% error rate produces an average of 33 errors per minute and if correction of only 20% of these errors were ‘critical’ to understanding then the editor would have to correct on average only about 7 errors per minute. 

Real time captioning (i.e. creating a live verbatim transcript of what is being spoken) using phonetic keyboards can provide a live transcription for deaf people and can cope accurately (e.g. >98%) with people talking at up to 240 words per minute but is often not available because of the cost and shortage of highly skilled and trained stenographers [1] [2]. 

To improve accuracy of verbatim captions created directly from the voice of the original speaker the application RealTimeEdit (RTE) was developed to enable corrections to ASR captions to be made in real-time [10]. 

Detailed feedback from students with a wide range of physical, sensory and cognitive disabilities and interviews with lecturers [6] showed that both students and teachers felt this approach improved teaching and learning as long as the text was reasonably accurate (e.g. >85%). 

Standard speech recognition (SR) software (e.g. Dragon, ViaVoice [3]) was found to be unsuitable for live transcription of speech as without the dictation of punctuation it produced a continuous unbroken stream of text that was very difficult to read and comprehend. 

The unedited outputs of ViaScribe can be merged and then edited as shownin figure 8 or if preferred, each instance of ViaScribe can have its output edited using RTE and then all the edited RTE outputs can be merged;Trials of the system in a variety of settings are being conducted to investigate in practice the effect of error rates, number of speakers, editing operator skill requirements etc. 

The combination of ViaScribe, ViaScribe server, PDC, RTE, and RTM enables a very flexible approach to be adopted that can provide solutions to many requirements. 

Text transcriptions of the spoken word can benefit deaf people and also anyone who needs to review what has been said (e.g. at lectures, presentations, meetings etc.) 

While projecting the text onto a large screen in the classroom has been used successfully in LL classrooms it is clear that in many situations an individual personalised and customisable display (e.g. font size, formatting, colour etc.) would be preferable or essential and so a personalised server and client was developed to enable users to customise their displays on their own networked computer [7].