How many corrections per minute can be achieved by a user of a text editor?

The unedited outputs of ViaScribe can be merged and then edited as shownin figure 8 or if preferred, each instance of ViaScribe can have its output edited using RTE and then all the edited RTE outputs can be merged;Trials of the system in a variety of settings are being conducted to investigate in practice the effect of error rates, number of speakers, editing operator skill requirements etc.

(Open Access) Enhancing the usability of real-time speech recognition captioning through personalised displays and real-time multiple speaker editing and annotation (2007) | Mike Wald

Enhancing the Usability of Real-Time Speech

Recognition Captioning through Personalised Displays

and Real-Time Multiple Speaker Editing and Annotation

Mike Wald

, Keith Bain

Learning Technologies Group, School of Electronics and Computer Science, University of

Southampton, Southampton SO171BJ, United Kingdom, M.Wald@soton.ac.uk

Liberated Learning, Saint Mary’s University, Halifax, NS B3H 3C3, Canada,

Keithbain@stmarys.ca

Abstract. Text transcriptions of the spoken word can benefit deaf people and

also anyone who needs to review what has been said (e.g. at lectures,

presentations, meetings etc.) Real time captioning (i.e. creating a live verbatim

transcript of what is being spoken) using phonetic keyboards can provide an

accurate live transcription for deaf people but is often not available because of

the cost and shortage of highly skilled and trained stenographers. This paper

describes the development of a system that can provide an automatic text

transcription of multiple speakers using speech recognition (SR), with the

names of speakers identified in the transcription and corrections of SR errors

made in real-time by a human ‘editor’.

Keywords: Real time, captioning, speech recognition, editing, multiple

speakers, transcription

1 Introduction

Text transcriptions of the spoken word can benefit deaf people and also anyone who

needs to review what has been said (e.g. at lectures, presentations, meetings etc.) Real

time captioning (i.e. creating a live verbatim transcript of what is being spoken) using

phonetic keyboards can provide a live transcription for deaf people and can cope

accurately (e.g. >98%) with people talking at up to 240 words per minute but is often

not available because of the cost and shortage of highly skilled and trained

stenographers [1] [2]. This paper describes the development of applications that use

speech recognition to provide automatic text transcriptions.

2 Visual Indication of Pauses

Standard speech recognition (SR) software (e.g. Dragon, ViaVoice [3]) was found to

be unsuitable for live transcription of speech as without the dictation of punctuation it

produced a continuous unbroken stream of text that was very difficult to read and

comprehend. IBM and Liberated Learning (LL) therefore developed ViaScribe [4] [5]

as an SR application that automatically formats real-time text captions from live

speech with a visual indication of pauses. Detailed feedback from students with a

wide range of physical, sensory and cognitive disabilities and interviews with

lecturers [6] showed that both students and teachers felt this approach improved

teaching and learning as long as the text was reasonably accurate (e.g. >85%).

3 Personalised and Customisable Display

While projecting the text onto a large screen in the classroom has been used

successfully in LL classrooms it is clear that in many situations an individual

personalised and customisable display (e.g. font size, formatting, colour etc.) would

be preferable or essential and so a personalised server and client was developed to

enable users to customise their displays on their own networked computer [7].

4 Real-Time Editing

SR accuracy may be reduced where the original speech is not of sufficient

volume/quality (e.g. poor microphone position, telephone, internet, television,

indistinct speaker) or when the system is not trained (e.g. multiple speakers, meetings,

panels, audience questions). An experienced trained ‘re-voicer’ repeating what has

been said can sometimes improve SR readability in these situations by correcting

ASR errors if the accuracy is high and the speaking rate low and summarising what is

being said if the speaking rates are fast [8] [9]. Summarisation however requires the

re-voicer to actually understand and ‘interpret’ what is being said and therefore to

have a good knowledge of the subject.

To improve accuracy of verbatim captions created directly from the voice of the

original speaker the application RealTimeEdit (RTE) was developed to enable

corrections to ASR captions to be made in real-time [10]. One editor can find and

correct errors or the task of finding and correcting errors can be shared between two

editors, one using the mouse and the other the keyboard. It is also possible to use

multiple editors sequentially to allow a 2nd operator to correct errors that a 1st

operator didn’t have time to correct. The editor can also annotate where required (e.g.

describe sounds <<LAUGHING>> or identify mumbled and clearly incorrectly

recognised words that they cannot identify as <<INAUDIBLE>>). In this way a real-

time editor can be used in situations where high accuracy captions are required and a

real-time stenographer is not available. Up to eleven corrections per minute were

achieved by untrained users of an initial prototype of RTE and a theoretical analysis

suggested experienced touch typists could be trained to achieve over 15 corrections

per minute. Analysis of an ASR transcript with a 22% error rate also suggested that

correction of less than 20% of the ‘critical’ errors may be required to understanding

the meaning of all the captions [11]. Somebody talking at 150 words per minute with

a 22% error rate produces an average of 33 errors per minute and if correction of only

20% of these errors were ‘critical’ to understanding then the editor would have to

correct on average only about 7 errors per minute. This would suggest that even if

100% accuracy was not achievable, 100% understanding might be. Judging which

words were critical to understanding might be an easier task than the summarisation

task faced by the re-voicer.

5 Multiple Speaker Transcriptions

In situations where there is more than one person speaking, using multiple instances

of ViaScribe creates captions in multiple windows making it difficult to follow the

sequence of the utterances. To produce a transcript of the session with speakers

identified, the application RealTimeMerge (RTM) was developed to add the speaker’s

name to the text captions and merge the streams from the instances of ViaScribe.

Each speaker and instance of ViaScribe can have a separate editor and the edited

outputs merged or the unedited outputs of ViaScribe can be edited. The combination

of ViaScribe, ViaScribe server, PDC, RTE, and RTM enables a very flexible

approach to be adopted that can provide solutions to many requirements. Figures 1-7

show how the recognised text from four speakers using four instances of Viascribe

can be output via the server and merged with the speaker’s names added and then

edited for errors before the corrected transcript is displayed on one or more clients.

Fig. 1. Four Instances of ViaScribe showing the ASR text captions with errors for

four separate speakers

Fig. 2. ViaScribe Server started

Fig. 3. RealTimeMerge Input setup window

Fig. 4. The four speakers’ names and Server IP addresses and Port numbers have been added to

RealTimeMerge

Fig. 5. RealTimeEdit displaying the merged captions and names of the four speakers’ output by

RealTimeMerge

Enhancing the usability of real-time speech recognition captioning through personalised displays and real-time multiple speaker editing and annotation

Figures

Citations

Using Speech Recognition for Real-Time Captioning and Lecture Transcription in the Classroom

Automated generation of ‘good enough’ transcripts as a first step to transcription of audio-recorded data:

Synchronised Annotation of Multimedia

E-Scribe: ubiquitous real-time speech transcription for the hearing-impaired

A Speech-To-Text System’s Acceptance Evaluation: Would Deaf Individuals Adopt This Technology in Their Lives?

References

Speech recognition in university classrooms: liberated learning project

Speech-Based Real-Time Subtitling Services

Creating accessible educational multimedia through editing automatic speech recognition captioning in real time

An exploration of the potential of Automatic Speech Recognition to assist and enable receptive communication in higher education

Using Automatic Speech Recognition to Assist Communication and Learning

Related Papers (5)

FILOCHAT: handwritten notes provide access to recorded conversations

Synchronised Annotation of Multimedia

Accessibility, transcription, and access everywhere

Speech recognition in university classrooms: liberated learning project

Captioning for deaf and hard of hearing people by editing automatic speech recognition in real time

Frequently Asked Questions (10)

Q1. What have the authors contributed in "Enhancing the usability of real-time speech recognition captioning through personalised displays and real-time multiple speaker editing and annotation" ?

Q2. How many errors per minute would be required to correct?

Q3. What is the main purpose of this paper?

Q4. What is the purpose of this paper?

Q5. What is the effect of a live transcription of speech?

Q6. What is the way to provide a live transcription of speech?

Q7. How many corrections per minute can be achieved by a user of a text editor?

Q8. How many corrections per minute can be achieved by a single user?

Q9. What is the main purpose of the paper?

Q10. What is the purpose of the paper?