What contributions have the authors mentioned in the paper "Toward an infrastructure for data-driven multimodal communication research" ?

This infrastructure makes it possible for researchers at multiple sites to work in real-time in transdisciplinary teams. The authors review the vision, progress, and prospects of this research consortium.

What are some of the detectors that are used in Red Hen?

Some of these detectors make use of machine learning models that are learned from data using supervised or unsupervised learning methods.

How can you access the metadata and annotations?

The metadata and annotations, along with the video and audio, can be accessed by Red Hen members through the Edge search engine (available via newsscape.library.ucla.edu), which provides an easy and userfriendly web-based user interface.

Why are captions created on the fly?

Because television captions are typically created on the fly by professional captioners, they lag behind the speech and video stream by a low but variable number of seconds.

What is the common way to analyze the structure of English text?

The SEMAFOR project (http://www.ark.cs.cmu.edu/SEMAFOR) performs an automatic analysis of the framesemantic structure of English text, using the FrameNet 1.5 release.

What is the main idea behind the use of tagged and searchable multimodal big data?

The availability of tagged and searchable multimodal big data opens up new opportunities for linguistics research, extending the utility of large corpora noted by Davies (2015).

Why are the annotations in red hen difficult to detect?

Imperfections arise because the structure of frames in the videos in Red Hen is very complex, so that it is often extremely difficult to detect precisely small motions or parts of the body.

What is the way to align the text with the audio?

Red Hen uses the open-source Gentle project (lowerquality.com/gentle) to align the text with the audio, generating precise timestamps for each word.

What is the purpose of the Red Hen search interface?

Red Hen provides a search interface (Figure 1) aligned to this need, developed collaboratively by linguists and computer scientists on their team, an example of the kind of interdisciplinary collaboration common in Red Hen.

(Open Access) Toward an infrastructure for data-driven multimodal communication research (2018) | Francis F. Steen

Q: What is the way to analyze video recordings?

video recordings can be submitted to a sketch filter, which removes textures critical to personal identification, yet retains structural elements of multimodal communication (Diemer et al. 2016).

Q: What is the purpose of this paper?

In this paper, the authors describe the Distributed Little Red Hen Lab, a global laboratory and consortium designed to facilitate large-scale collaborative research into multimodal communication.

University of Southern Denmark

Toward an infrastructure for data-driven multimodal communication research

Steen, Francis F.; Hougaard, Anders; Joo, Jungseock; Olza, Inés; Cánovas, Cristóbal Pagán;

Pleshakova, Anna; Ray, Soumya; Uhrig, Peter; Valenzuela, Javier; Woźny, Jacek; Turner,

Mark

Published in:

Linguistics Vanguard

DOI:

10.1515/lingvan-2017-0041

Publication date:

2018

Document version:

Final published version

Citation for pulished version (APA):

Steen, F. F., Hougaard, A., Joo, J., Olza, I., Cánovas, C. P., Pleshakova, A., Ray, S., Uhrig, P., Valenzuela, J.,

Woźny, J., & Turner, M. (2018). Toward an infrastructure for data-driven multimodal communication research.

Linguistics Vanguard

(1), [20170041]. https://doi.org/10.1515/lingvan-2017-0041

Go to publication entry in University of Southern Denmark's Research Portal

This work is brought to you by the University of Southern Denmark.

Unless otherwise specified it has been shared according to the terms for self-archiving.

If no other license is stated, these terms apply:

• You may download this work for personal use only.

• You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying this open access version

If you believe that this document breaches copyright please contact us providing details and we will investigate your claim.

Please direct all enquiries to puresupport@bib.sdu.dk

Download date: 30. May. 2022

Linguistics Vanguard 2018; 20170041

Francis F. Steen, Anders Hougaard, Jungseock Joo, Inés Olza, Cristóbal Pagán

Cánovas, Anna Pleshakova, Soumya Ray, Peter Uhrig, Javier Valenzuela, Jacek Woźny

and Mark Turner*

Toward an infrastructure for data-driven

multimodal communication research

https://doi.org/10.1515/lingvan-2017-0041

Received September 7, 2017; accepted January 31, 2018

Abstract: Research into the multimodal dimensions of human communication faces a set of distinctive

methodological challenges. Collecting the datasets is resource-intensive, analysis often lacks peer validation,

and the absence of shared datasets makes it difficult to develop standards. External validity is hampered by

small datasets, yet large datasets are intractable. Red Hen Lab spearheads an international infrastructure

for data-driven multimodal communication research, facilitating an integrated cross-disciplinary workflow.

Linguists, communication scholars, statisticians, and computer scientists work together to develop research

questions, annotate training sets, and develop pattern discovery and machine learning tools that handle vast

collections of multimodal data, beyond the dreams of previous researchers. This infrastructure makes it pos-

sible for researchers at multiple sites to work in real-time in transdisciplinary teams. We review the vision,

progress, and prospects of this research consortium.

Keywords: multimodality; machine learning; automated parsing; corpora; research consortia.

1 Introduction

Human face-to-face communication has always taken place across multiple modalities: through gesture,

facial expression, posture, tone of voice, pacing, gaze direction, touch, and words. Elaborate multimodal

communication is a central and constantly active part of human cognition, in science, technology, engineer-

ing, mathematics, art, religion, crafts, social interaction, learning, innovation, memory, attention, travel, and

all other activities, whether goal-based or not. Cultures invest heavily to support this aspect of human life:

classical cultures emphasized the importance of rhetorical training, and today’s world is crowded with novel

technologies of multimodal communication, from television to social media, creating an unprecedented

trove of digital records. Communication skills involve higher-order cognition, precisely timed movements,

delicately modulated sounds, conceiving of the mental states of others from moment to moment, dynami-

cally coordinating with other agents, and a high level of contextual awareness (Duranti and Goodwin 1992;

Clark 1996).

From Panini (Sharma 1987–2003) to Chomsky and McGilvray (2012), the systematic study of human com-

munication has been largely focused on the written representation of language: understandably so, as it is

highly structured and can be shared in the process of describing and explaining it. The full multimodal dimen-

sions of communication have a very brief history of scholarship and present new methodological challenges.

*Corresponding author: Mark Turner, Case Western Reserve University, Cleveland, OH, USA, E-mail: mark.turner@case.edu.

http://orcid.org/0000-0002-2089-5978

Francis F. Steen and Jungseock Joo: University of California Los Angeles, Los Angeles, CA, USA. http://orcid.org/0000-0003-

2963-4077 (F.F. Steen)

Anders Hougaard: University of Southern Denmark, Odense, Denmark

Inés Olza and Cristóbal Pagán Cánovas: University of Navarra, Pamplona, Spain

Anna Pleshakova: University of Oxford, Oxford, UK

Soumya Ray: Case Western Reserve University, Cleveland, OH, USA

Peter Uhrig: FAU Erlangen-Nürnberg, Erlangen, Germany

Javier Valenzuela: University of Murcia, Murcia, Spain

Jacek Woźny: University of Wrocław, Wroclaw, Poland. http://orcid.org/0000-0003-0691-7090

Brought to you by | University Library of Southern Denmark - Syddansk Universitetsbibliotek

Authenticated

Download Date | 10/9/19 2:39 PM

2 | F.F. Steen et al.: Toward an infrastructure for data-driven multimodal communication research

Communicative behavior must be recorded with resource-intensive audiovisual technologies. Since the range

of available expressions is so wide, individual researchers need to specialize in specific modalities and con-

structions. Naturalistic data are typically not readily available; boutique collections from lab recordings take

their place. Large-scale datasets are required for systematic study, yet no single researcher has the required

time, resources, or motivation to create them. Worse, any group of researchers that succeeds in generating a

massive dataset of multimodal communication will quickly be overwhelmed, since linguistics lacks the tools

for mechanically searching and characterizing the material.

For the study of digitized written language, a wide range of technologies and tools available has been

developed over the past decades in the contexts of corpus linguistics, computational linguistics, and artifi-

cial intelligence. In this context, carefully sampled corpora such as the British National Corpus (BNC) as well

as larger, less carefully sampled corpora have emerged along with corpus retrieval software such as BNCweb

(Hoffmann and Evert 2006) and its generalized and extended version, CQPweb (Hardie 2012), or the com-

mercial Sketchengine (www.sketchengine.co.uk). While these are outstanding examples of research at the

intersection of computer science and linguistics, they have not yet embraced the full multimodal spectrum

of human communication, creating a well-defined disciplinary and interdisciplinary challenge.

At the same time, the social landscape of communication has exploded with new multimodal technolo-

gies, from television to social media, intruding on our most personal as well as our most public commu-

nicative functions. Since these exchanges are taking place in digital form, the age-old problem of how to

capture multimodal communication in a naturalistic setting is now fully tractable. This drastically reduces

the costs incurred by capturing and transcribing naturalistic spoken language such as the audio recordings

collected for the spoken demographic section of the BNC (see Crowdy 1995 for details). So far, only modest

attempts have been made to add audiovisual data to existing corpora; for instance, the Russian National

Corpus (www.ruscorpora.ru) began in 2010 to include a selection of recordings and movies from 1930 to

2007, with 4.6 million words of transcripts. Only recently has the construction of massive multimodal corpora

become feasible.

Such large-scale datasets present both an opportunity and a challenge for linguists. On the one hand, we

can now attest the presence, context, and frequency of known constructions in ecologically valid datasets,

extending, correcting, and validating decades of laboratory research. On the other hand, these datasets are

so large that they quickly swamp manual analysis. The challenge must be met with a new level of inter-

disciplinary collaboration between linguists and computer scientists. Both fields have much to gain. Com-

putational researchers gain insights into natural modes of communication, useful for designing good user

interfaces and natural interactions with robotic systems; linguists gain the knowledge of tools and methods

from computer vision, audio signal processing and machine learning to analyze large amounts of data. We

see an opportunity to create a collaborative and distributed social and physical infrastructure for data-driven

multimodal communication research.

We can draw inspiration from other sciences in which cooperatives of researchers with diverse back-

grounds were established to share the gathering of data and the development of tools and analysis in real-

time. Faced with a similar mix of massive new datasets and a demand for radically new methodologies,

astronomy and genetics have undergone a comparable transformation in their disciplinary practices and

thrived. Genomics dramatically speeded up its advances by creating worldwide consortia of researchers using

collaborative web platforms; see for instance the Mission Statement and Framework of the Global Alliance

for Genomics and Health (genomicsandhealth.org). In neuroscience, there are analogous initiatives, such as

the Brainhack project (Craddock et al. 2016).

Such cooperative frameworks exhibit novel social dynamics and facilitate rapid disciplinary progress. On

the one hand, information begins to flow across disciplinary boundaries, giving computational researchers

access to novel real-world problems and researchers in the target domain exposure to new methods and

skills. See, for example, the recent applications of multimodal computational methods in political science

(Joo et al. 2015), psychology (Martinez 2017), or cognitive film studies (Suchan and Bhatt 2016). Just as impor-

tant, the research results of one researcher – the actual data selection, annotation, and analysis – can be

provided immediately and substantively to the whole community, and replicated and built on in a meaningful

way.

Brought to you by | University Library of Southern Denmark - Syddansk Universitetsbibliotek

Authenticated

Download Date | 10/9/19 2:39 PM

F.F. Steen et al.: Toward an infrastructure for data-driven multimodal communication research | 3

In this paper, we describe the Distributed Little Red Hen Lab, a global laboratory and consortium

designed to facilitate large-scale collaborative research into multimodal communication. As part of this

project, we collect data on multimodal communication on a large scale, provide computational and stor-

age tools to manage data and aid in knowledge discovery, and provide means of iterative improvement by

integrating the results and feedback of researchers into the project.

Red Hen’s vision and program arise naturally from considerations that are common and frequent across

all the sciences, concerning how to improve the way we do science – by developing an extensive and

constantly-networked cooperative, by developing sociological patterns of extensive real-time collaboration

across the cooperative, and by aggregating big data and developing new methods and tools that are deployed

across the cooperative. Such considerations have become inescapable for several disciplines, from biology to

materials science, linguistics to archeology, genomics to neuroscience, astronomy to computer science. Red

Hen brings these impulses to the science of human multimodal communication. As an example, for the past

4 years, Red Hen has partnered with Google Summer of Code to connect Computer Science students from

around the world with expert mentors, generating a suite of new tools to analyze human communication.

Red Hen is not designed to be a service. Instead, it provides a framework for collaboration, pooling

expertise and resources. Access to the Red Hen tools and data are provided through the project website

(redhenlab.org), where researchers can both access data and contribute or provide feedback to the Red Hen

project.

A core activity of Red Hen Lab is an international effort to create the physical and social infrastructure

needed for the systematic study of multimodal communication. Key elements are data collection, data mining

tool development, and search engines.

2 Generating massive multimodal datasets

A shared dataset is an essential aspect of the infrastructure required for data-driven multimodal communica-

tion research. Red Hen is open to datasets in any area in which there are records of human communication.

This includes text, speech and audio recordings in any language, infant vocalization, art and sculpture, writ-

ing and notation systems, audiovisual records, architecture, signage, and of course, modern digital media.

Records and methods related to non-human communication or communication between species (e.g., border

collies responding to pointing gestures) are also naturally of interest to Red Hen. In principle, any recording in

any format of any human communication is suitable for inclusion in the archive, which consists of networked

data across the Red Hen cooperative, either natively digital or converted to digital form.

The most efficient way to acquire a massive multimodal and multilingual dataset is to record television,

a task that can be fully automated. Fortunately, section 108 of the U.S. Copyright Act authorizes libraries and

archives to record and store any broadcast of any audiovisual news program and to loan those data, within

some limits of due diligence for the purpose of research. The NewsScape Archive of International Television

News (newsscape.library.ucla.edu) is Red Hen’s largest; as of November 2017, it included broadcasts from

51 networks, totaling 350,000 hours and occupying 120 terabytes. The collection dates back to 2005 and is

growing at around 5,000 shows a month. It is an official archive of the University of California, Los Angeles

(UCLA) Library, the digital continuation of UCLA’s Communication Studies Archive, initiated by Paul Rosen-

thal in 1972. The analog collection is in the process of being digitized, promising to add additional years of

historical depth to the collection. Under Red Hen, it has been expanded to record television news in multiple

countries around the world, curated by local linguists participating in the Red Hen project. The NewsScape

Archive now includes, in rough order of representation, broadcasts in English, Spanish, German, French, Nor-

wegian, Swedish, Danish, Continental Portuguese, Brazilian Portuguese, Russian, Polish, Czech, Flemish,

Persian, Italian, Arabic, and Chinese. The system is fully automated and scales easily, using credit-card-sized

Raspberry Pi capture stations running custom open-source software.

This television news dataset includes hard and soft news, including talk shows and comedy, along with

B-roll of surveillance video, crowd-sourced videos, recordings of public events where participants do not even

Brought to you by | University Library of Southern Denmark - Syddansk Universitetsbibliotek

Authenticated

Download Date | 10/9/19 2:39 PM

4 | F.F. Steen et al.: Toward an infrastructure for data-driven multimodal communication research

know they are being recorded, etc. These genres contain a range of registers that include banter, unscripted

conversations, and playful interviews. The studio components of the television news shows in NewsScape

typically also contain a great amount of unscripted or partially improvised communicative events. The most

constrained register, in which a speaker reads a text or recites a pre-prepared speech more or less verbatim, is

no longer the standard way to communicate on television. This makes NewsScape a rich resource for study-

ing largely spontaneous or unconscious aspects of multimodal communication, along with communicative

behaviors associated with a range of formal registers.

Red Hen’s infrastructure and tools also permit the incorporation of existing datasets, such as hand-

crafted collections of experimental data. The news material constitutes the bulk of the current collection, as

this content is clearly protected by the US Copyright Act. Smaller datasets generated by individual researchers

and teams, including student projects, are being added and described, and will be the subject of future

publications.

Red Hen proposes two complementary strategies to deal with recordings that are protected by confiden-

tiality laws. First, although lab recordings are typically protected by Institutional Review Board regulations,

nonetheless it seems plausible that an IRB might approve machine analysis of such recordings. Results of

such analysis may be shareable, provided the data is anonymized. Second, video recordings can be submit-

ted to a sketch filter, which removes textures critical to personal identification, yet retains structural elements

of multimodal communication (Diemer et al. 2016).

3 Creating and searching metadata and annotations

Vast multimodal datasets are a boon and a curse. Linguists need them to validate existing constructions in

ecologically valid datasets, and can revel in the prospect of testing an entire generation of new hypotheses,

asking questions we simply lacked the data to answer. However, to effectively convert such data to knowl-

edge, we need automated search capabilities, and to search, we need machine-readable transcripts, ideally

enriched with metadata and annotations. Red Hen’s annotation process relies on a multi-level feedback pro-

cess between linguists and computer scientists, aimed at training computers to perform tasks that generate

annotations according to the linguist’s specifications.

The video stream is compressed to a 640 × 480 or similar picture size at 450 kbps; the audio stream is

a stereo signal with a sampling rate of 44.1 kHz compressed to a bitrate of 96 kb/s. Red Hen expects that

the 44.1 kHz sampling rate and the 96 kb/s bitrate will be sufficient to make most of the audible frequencies

usable for spectrograms, but detailed tests are yet to be conducted.

Red Hen textual data is encoded in UTF-8, using the universal standard of comma-separated values, and

named to identify the time, source, and type of the recording. The data is stored on UCLA Library servers

and elsewhere within the Red Hen network as needed. This provides the input to a variety of custom search

engines.

A series of pipelines process these data, using customized open-source software. Some tools require rel-

atively little customization and can be deployed without deep modifications. For example, transcripts are

automatically extracted from television video in the form of subtitles. For US broadcasts, commercials are

automatically detected and annotated. Additional text that is written on the television screen is also extracted,

using tesseract-ocr (github.com/ tesseract-ocr) with significant customizations for eight different languages,

examining frames at one-second intervals and retaining screen placement information.

In multimodal data analysis, timing is of the essence. Centisecond timestamps in UTC permit precise cor-

relations of data extracted from different modalities. To validate a multimodal construction, we need reliable

timestamps at the word level, so that individual words can be shown to co-occur with a gesture of the eyes, the

face, the shoulders, the arms, and the hands. To achieve this, the caption text is first parsed into sentences,

using custom software developed by manual inspection of abbreviations and conventions characteristic of

the medium. These sentences are fed into Stanford CoreNLP (stanfordnlp.github.io/CoreNLP), a set of nat-

ural language processing utilities providing parts of speech, lemmas, and named entities in half a dozen

Brought to you by | University Library of Southern Denmark - Syddansk Universitetsbibliotek

Authenticated

Download Date | 10/9/19 2:39 PM

Toward an infrastructure for data-driven multimodal communication research

Figures

Citations

打磨Using Language,倡导新理念

Rethinking Context Language As An Interactive Phenomenon

The Role of Creativity in Multimodal Construction Grammar

Quantifying the speech-gesture relation with massive multimodal datasets: Informativity in time expressions.

Temporal Expressions in English and Spanish: Influence of Typology and Metaphorical Construal.

References

打磨Using Language,倡导新理念

Rethinking Context: Language As An Interactive Phenomenon

Rethinking Context Language As An Interactive Phenomenon

CQPweb — combining power, flexibility and usability in a corpus analysis tool

The Science of Language: Human nature and its study

Related Papers (5)

Internet memes as multimodal constructions

5. Identity, Motivation and Autonomy in Second Language Acquisition from the Perspective of Complex Adaptive Systems

A Grammar of Kayardild: With Historical-Comparative Notes on Tangkic

Intonation and grammar in British English

Key Concepts in Language and Linguistics

Frequently Asked Questions (15)

Q1. What contributions have the authors mentioned in the paper "Toward an infrastructure for data-driven multimodal communication research" ?

Q2. What is the way to analyze video recordings?

Q3. What are some of the detectors that are used in Red Hen?

Q4. How can you access the metadata and annotations?

Q5. Why are captions created on the fly?

Q6. What is the common way to analyze the structure of English text?

Q7. What is the main idea behind the use of tagged and searchable multimodal big data?

Q8. What are the main features of Red Hen’s annotation process?

Q9. What is the purpose of this paper?

Q10. Why are the annotations in red hen difficult to detect?

Q11. What is the way to align the text with the audio?

Q12. What is the purpose of the Red Hen search interface?

Q13. What is the principle of the archive?

Q14. What is the name of the archive?

Q15. What is the main purpose of Red Hen?

Trending Questions (1)