scispace - formally typeset
Open AccessJournal ArticleDOI

Toward an infrastructure for data-driven multimodal communication research

TLDR
Red Hen Lab spearheads an international infrastructure for data-driven multimodal communication research, facilitating an integrated cross-disciplinary workflow that makes it possible for researchers at multiple sites to work in real-time in transdisciplinary teams.
Abstract
Abstract Research into the multimodal dimensions of human communication faces a set of distinctive methodological challenges. Collecting the datasets is resource-intensive, analysis often lacks peer validation, and the absence of shared datasets makes it difficult to develop standards. External validity is hampered by small datasets, yet large datasets are intractable. Red Hen Lab spearheads an international infrastructure for data-driven multimodal communication research, facilitating an integrated cross-disciplinary workflow. Linguists, communication scholars, statisticians, and computer scientists work together to develop research questions, annotate training sets, and develop pattern discovery and machine learning tools that handle vast collections of multimodal data, beyond the dreams of previous researchers. This infrastructure makes it possible for researchers at multiple sites to work in real-time in transdisciplinary teams. We review the vision, progress, and prospects of this research consortium.

read more

Content maybe subject to copyright    Report

University of Southern Denmark
Toward an infrastructure for data-driven multimodal communication research
Steen, Francis F.; Hougaard, Anders; Joo, Jungseock; Olza, Inés; Cánovas, Cristóbal Pagán;
Pleshakova, Anna; Ray, Soumya; Uhrig, Peter; Valenzuela, Javier; Woźny, Jacek; Turner,
Mark
Published in:
Linguistics Vanguard
DOI:
10.1515/lingvan-2017-0041
Publication date:
2018
Document version:
Final published version
Citation for pulished version (APA):
Steen, F. F., Hougaard, A., Joo, J., Olza, I., Cánovas, C. P., Pleshakova, A., Ray, S., Uhrig, P., Valenzuela, J.,
Woźny, J., & Turner, M. (2018). Toward an infrastructure for data-driven multimodal communication research.
Linguistics Vanguard
,
4
(1), [20170041]. https://doi.org/10.1515/lingvan-2017-0041
Go to publication entry in University of Southern Denmark's Research Portal
Terms of use
This work is brought to you by the University of Southern Denmark.
Unless otherwise specified it has been shared according to the terms for self-archiving.
If no other license is stated, these terms apply:
• You may download this work for personal use only.
• You may not further distribute the material or use it for any profit-making activity or commercial gain
• You may freely distribute the URL identifying this open access version
If you believe that this document breaches copyright please contact us providing details and we will investigate your claim.
Please direct all enquiries to puresupport@bib.sdu.dk
Download date: 30. May. 2022

Linguistics Vanguard 2018; 20170041
Francis F. Steen, Anders Hougaard, Jungseock Joo, Inés Olza, Cristóbal Pagán
Cánovas, Anna Pleshakova, Soumya Ray, Peter Uhrig, Javier Valenzuela, Jacek Woźny
and Mark Turner*
Toward an infrastructure for data-driven
multimodal communication research
https://doi.org/10.1515/lingvan-2017-0041
Received September 7, 2017; accepted January 31, 2018
Abstract: Research into the multimodal dimensions of human communication faces a set of distinctive
methodological challenges. Collecting the datasets is resource-intensive, analysis often lacks peer validation,
and the absence of shared datasets makes it difficult to develop standards. External validity is hampered by
small datasets, yet large datasets are intractable. Red Hen Lab spearheads an international infrastructure
for data-driven multimodal communication research, facilitating an integrated cross-disciplinary workflow.
Linguists, communication scholars, statisticians, and computer scientists work together to develop research
questions, annotate training sets, and develop pattern discovery and machine learning tools that handle vast
collections of multimodal data, beyond the dreams of previous researchers. This infrastructure makes it pos-
sible for researchers at multiple sites to work in real-time in transdisciplinary teams. We review the vision,
progress, and prospects of this research consortium.
Keywords: multimodality; machine learning; automated parsing; corpora; research consortia.
1 Introduction
Human face-to-face communication has always taken place across multiple modalities: through gesture,
facial expression, posture, tone of voice, pacing, gaze direction, touch, and words. Elaborate multimodal
communication is a central and constantly active part of human cognition, in science, technology, engineer-
ing, mathematics, art, religion, crafts, social interaction, learning, innovation, memory, attention, travel, and
all other activities, whether goal-based or not. Cultures invest heavily to support this aspect of human life:
classical cultures emphasized the importance of rhetorical training, and today’s world is crowded with novel
technologies of multimodal communication, from television to social media, creating an unprecedented
trove of digital records. Communication skills involve higher-order cognition, precisely timed movements,
delicately modulated sounds, conceiving of the mental states of others from moment to moment, dynami-
cally coordinating with other agents, and a high level of contextual awareness (Duranti and Goodwin 1992;
Clark 1996).
From Panini (Sharma 19872003) to Chomsky and McGilvray (2012), the systematic study of human com-
munication has been largely focused on the written representation of language: understandably so, as it is
highly structured and can be shared in the process of describing and explaining it. The full multimodal dimen-
sions of communication have a very brief history of scholarship and present new methodological challenges.
*Corresponding author: Mark Turner, Case Western Reserve University, Cleveland, OH, USA, E-mail: mark.turner@case.edu.
http://orcid.org/0000-0002-2089-5978
Francis F. Steen and Jungseock Joo: University of California Los Angeles, Los Angeles, CA, USA. http://orcid.org/0000-0003-
2963-4077 (F.F. Steen)
Anders Hougaard: University of Southern Denmark, Odense, Denmark
Inés Olza and Cristóbal Pagán Cánovas: University of Navarra, Pamplona, Spain
Anna Pleshakova: University of Oxford, Oxford, UK
Soumya Ray: Case Western Reserve University, Cleveland, OH, USA
Peter Uhrig: FAU Erlangen-Nürnberg, Erlangen, Germany
Javier Valenzuela: University of Murcia, Murcia, Spain
Jacek Woźny: University of Wrocław, Wroclaw, Poland. http://orcid.org/0000-0003-0691-7090
Brought to you by | University Library of Southern Denmark - Syddansk Universitetsbibliotek
Authenticated
Download Date | 10/9/19 2:39 PM

2 | F.F. Steen et al.: Toward an infrastructure for data-driven multimodal communication research
Communicative behavior must be recorded with resource-intensive audiovisual technologies. Since the range
of available expressions is so wide, individual researchers need to specialize in specific modalities and con-
structions. Naturalistic data are typically not readily available; boutique collections from lab recordings take
their place. Large-scale datasets are required for systematic study, yet no single researcher has the required
time, resources, or motivation to create them. Worse, any group of researchers that succeeds in generating a
massive dataset of multimodal communication will quickly be overwhelmed, since linguistics lacks the tools
for mechanically searching and characterizing the material.
For the study of digitized written language, a wide range of technologies and tools available has been
developed over the past decades in the contexts of corpus linguistics, computational linguistics, and artifi-
cial intelligence. In this context, carefully sampled corpora such as the British National Corpus (BNC) as well
as larger, less carefully sampled corpora have emerged along with corpus retrieval software such as BNCweb
(Hoffmann and Evert 2006) and its generalized and extended version, CQPweb (Hardie 2012), or the com-
mercial Sketchengine (www.sketchengine.co.uk). While these are outstanding examples of research at the
intersection of computer science and linguistics, they have not yet embraced the full multimodal spectrum
of human communication, creating a well-defined disciplinary and interdisciplinary challenge.
At the same time, the social landscape of communication has exploded with new multimodal technolo-
gies, from television to social media, intruding on our most personal as well as our most public commu-
nicative functions. Since these exchanges are taking place in digital form, the age-old problem of how to
capture multimodal communication in a naturalistic setting is now fully tractable. This drastically reduces
the costs incurred by capturing and transcribing naturalistic spoken language such as the audio recordings
collected for the spoken demographic section of the BNC (see Crowdy 1995 for details). So far, only modest
attempts have been made to add audiovisual data to existing corpora; for instance, the Russian National
Corpus (www.ruscorpora.ru) began in 2010 to include a selection of recordings and movies from 1930 to
2007, with 4.6 million words of transcripts. Only recently has the construction of massive multimodal corpora
become feasible.
Such large-scale datasets present both an opportunity and a challenge for linguists. On the one hand, we
can now attest the presence, context, and frequency of known constructions in ecologically valid datasets,
extending, correcting, and validating decades of laboratory research. On the other hand, these datasets are
so large that they quickly swamp manual analysis. The challenge must be met with a new level of inter-
disciplinary collaboration between linguists and computer scientists. Both fields have much to gain. Com-
putational researchers gain insights into natural modes of communication, useful for designing good user
interfaces and natural interactions with robotic systems; linguists gain the knowledge of tools and methods
from computer vision, audio signal processing and machine learning to analyze large amounts of data. We
see an opportunity to create a collaborative and distributed social and physical infrastructure for data-driven
multimodal communication research.
We can draw inspiration from other sciences in which cooperatives of researchers with diverse back-
grounds were established to share the gathering of data and the development of tools and analysis in real-
time. Faced with a similar mix of massive new datasets and a demand for radically new methodologies,
astronomy and genetics have undergone a comparable transformation in their disciplinary practices and
thrived. Genomics dramatically speeded up its advances by creating worldwide consortia of researchers using
collaborative web platforms; see for instance the Mission Statement and Framework of the Global Alliance
for Genomics and Health (genomicsandhealth.org). In neuroscience, there are analogous initiatives, such as
the Brainhack project (Craddock et al. 2016).
Such cooperative frameworks exhibit novel social dynamics and facilitate rapid disciplinary progress. On
the one hand, information begins to flow across disciplinary boundaries, giving computational researchers
access to novel real-world problems and researchers in the target domain exposure to new methods and
skills. See, for example, the recent applications of multimodal computational methods in political science
(Joo et al. 2015), psychology (Martinez 2017), or cognitive film studies (Suchan and Bhatt 2016). Just as impor-
tant, the research results of one researcher the actual data selection, annotation, and analysis can be
provided immediately and substantively to the whole community, and replicated and built on in a meaningful
way.
Brought to you by | University Library of Southern Denmark - Syddansk Universitetsbibliotek
Authenticated
Download Date | 10/9/19 2:39 PM

F.F. Steen et al.: Toward an infrastructure for data-driven multimodal communication research | 3
In this paper, we describe the Distributed Little Red Hen Lab, a global laboratory and consortium
designed to facilitate large-scale collaborative research into multimodal communication. As part of this
project, we collect data on multimodal communication on a large scale, provide computational and stor-
age tools to manage data and aid in knowledge discovery, and provide means of iterative improvement by
integrating the results and feedback of researchers into the project.
Red Hen’s vision and program arise naturally from considerations that are common and frequent across
all the sciences, concerning how to improve the way we do science by developing an extensive and
constantly-networked cooperative, by developing sociological patterns of extensive real-time collaboration
across the cooperative, and by aggregating big data and developing new methods and tools that are deployed
across the cooperative. Such considerations have become inescapable for several disciplines, from biology to
materials science, linguistics to archeology, genomics to neuroscience, astronomy to computer science. Red
Hen brings these impulses to the science of human multimodal communication. As an example, for the past
4 years, Red Hen has partnered with Google Summer of Code to connect Computer Science students from
around the world with expert mentors, generating a suite of new tools to analyze human communication.
Red Hen is not designed to be a service. Instead, it provides a framework for collaboration, pooling
expertise and resources. Access to the Red Hen tools and data are provided through the project website
(redhenlab.org), where researchers can both access data and contribute or provide feedback to the Red Hen
project.
A core activity of Red Hen Lab is an international effort to create the physical and social infrastructure
needed for the systematic study of multimodal communication. Key elements are data collection, data mining
tool development, and search engines.
2 Generating massive multimodal datasets
A shared dataset is an essential aspect of the infrastructure required for data-driven multimodal communica-
tion research. Red Hen is open to datasets in any area in which there are records of human communication.
This includes text, speech and audio recordings in any language, infant vocalization, art and sculpture, writ-
ing and notation systems, audiovisual records, architecture, signage, and of course, modern digital media.
Records and methods related to non-human communication or communication between species (e.g., border
collies responding to pointing gestures) are also naturally of interest to Red Hen. In principle, any recording in
any format of any human communication is suitable for inclusion in the archive, which consists of networked
data across the Red Hen cooperative, either natively digital or converted to digital form.
The most efficient way to acquire a massive multimodal and multilingual dataset is to record television,
a task that can be fully automated. Fortunately, section 108 of the U.S. Copyright Act authorizes libraries and
archives to record and store any broadcast of any audiovisual news program and to loan those data, within
some limits of due diligence for the purpose of research. The NewsScape Archive of International Television
News (newsscape.library.ucla.edu) is Red Hen’s largest; as of November 2017, it included broadcasts from
51 networks, totaling 350,000 hours and occupying 120 terabytes. The collection dates back to 2005 and is
growing at around 5,000 shows a month. It is an official archive of the University of California, Los Angeles
(UCLA) Library, the digital continuation of UCLA’s Communication Studies Archive, initiated by Paul Rosen-
thal in 1972. The analog collection is in the process of being digitized, promising to add additional years of
historical depth to the collection. Under Red Hen, it has been expanded to record television news in multiple
countries around the world, curated by local linguists participating in the Red Hen project. The NewsScape
Archive now includes, in rough order of representation, broadcasts in English, Spanish, German, French, Nor-
wegian, Swedish, Danish, Continental Portuguese, Brazilian Portuguese, Russian, Polish, Czech, Flemish,
Persian, Italian, Arabic, and Chinese. The system is fully automated and scales easily, using credit-card-sized
Raspberry Pi capture stations running custom open-source software.
This television news dataset includes hard and soft news, including talk shows and comedy, along with
B-roll of surveillance video, crowd-sourced videos, recordings of public events where participants do not even
Brought to you by | University Library of Southern Denmark - Syddansk Universitetsbibliotek
Authenticated
Download Date | 10/9/19 2:39 PM

4 | F.F. Steen et al.: Toward an infrastructure for data-driven multimodal communication research
know they are being recorded, etc. These genres contain a range of registers that include banter, unscripted
conversations, and playful interviews. The studio components of the television news shows in NewsScape
typically also contain a great amount of unscripted or partially improvised communicative events. The most
constrained register, in which a speaker reads a text or recites a pre-prepared speech more or less verbatim, is
no longer the standard way to communicate on television. This makes NewsScape a rich resource for study-
ing largely spontaneous or unconscious aspects of multimodal communication, along with communicative
behaviors associated with a range of formal registers.
Red Hen’s infrastructure and tools also permit the incorporation of existing datasets, such as hand-
crafted collections of experimental data. The news material constitutes the bulk of the current collection, as
this content is clearly protected by the US Copyright Act. Smaller datasets generated by individual researchers
and teams, including student projects, are being added and described, and will be the subject of future
publications.
Red Hen proposes two complementary strategies to deal with recordings that are protected by confiden-
tiality laws. First, although lab recordings are typically protected by Institutional Review Board regulations,
nonetheless it seems plausible that an IRB might approve machine analysis of such recordings. Results of
such analysis may be shareable, provided the data is anonymized. Second, video recordings can be submit-
ted to a sketch filter, which removes textures critical to personal identification, yet retains structural elements
of multimodal communication (Diemer et al. 2016).
3 Creating and searching metadata and annotations
Vast multimodal datasets are a boon and a curse. Linguists need them to validate existing constructions in
ecologically valid datasets, and can revel in the prospect of testing an entire generation of new hypotheses,
asking questions we simply lacked the data to answer. However, to effectively convert such data to knowl-
edge, we need automated search capabilities, and to search, we need machine-readable transcripts, ideally
enriched with metadata and annotations. Red Hen’s annotation process relies on a multi-level feedback pro-
cess between linguists and computer scientists, aimed at training computers to perform tasks that generate
annotations according to the linguist’s specifications.
The video stream is compressed to a 640 × 480 or similar picture size at 450 kbps; the audio stream is
a stereo signal with a sampling rate of 44.1 kHz compressed to a bitrate of 96 kb/s. Red Hen expects that
the 44.1 kHz sampling rate and the 96 kb/s bitrate will be sufficient to make most of the audible frequencies
usable for spectrograms, but detailed tests are yet to be conducted.
Red Hen textual data is encoded in UTF-8, using the universal standard of comma-separated values, and
named to identify the time, source, and type of the recording. The data is stored on UCLA Library servers
and elsewhere within the Red Hen network as needed. This provides the input to a variety of custom search
engines.
A series of pipelines process these data, using customized open-source software. Some tools require rel-
atively little customization and can be deployed without deep modifications. For example, transcripts are
automatically extracted from television video in the form of subtitles. For US broadcasts, commercials are
automatically detected and annotated. Additional text that is written on the television screen is also extracted,
using tesseract-ocr (github.com/ tesseract-ocr) with significant customizations for eight different languages,
examining frames at one-second intervals and retaining screen placement information.
In multimodal data analysis, timing is of the essence. Centisecond timestamps in UTC permit precise cor-
relations of data extracted from different modalities. To validate a multimodal construction, we need reliable
timestamps at the word level, so that individual words can be shown to co-occur with a gesture of the eyes, the
face, the shoulders, the arms, and the hands. To achieve this, the caption text is first parsed into sentences,
using custom software developed by manual inspection of abbreviations and conventions characteristic of
the medium. These sentences are fed into Stanford CoreNLP (stanfordnlp.github.io/CoreNLP), a set of nat-
ural language processing utilities providing parts of speech, lemmas, and named entities in half a dozen
Brought to you by | University Library of Southern Denmark - Syddansk Universitetsbibliotek
Authenticated
Download Date | 10/9/19 2:39 PM

Citations
More filters

打磨Using Language,倡导新理念

付伶俐
TL;DR: Using Language部分的�’学模式既不落俗套,又能真正体现新课程标准所倡导的�'学理念,正是年努力探索的问题.

Rethinking Context Language As An Interactive Phenomenon

Ralf Dresner
TL;DR: Thank you very much for reading rethinking context language as an interactive phenomenon, where people have look hundreds of times for their chosen novels, but end up in malicious downloads.
Journal ArticleDOI

The Role of Creativity in Multimodal Construction Grammar

TL;DR: The construction grammar accepts responsibility to account for forms of creativity otherwise almost entirely ignored in linguistics as discussed by the authors, which is wise, given that creativity is the engine that develops systems of communication.
Journal ArticleDOI

Quantifying the speech-gesture relation with massive multimodal datasets: Informativity in time expressions.

TL;DR: This work utilized the UCLA-Red Hen Lab multi-billion-word repository of video recordings, all of them showing communicative behavior that was not elicited in a lab, to quantify speech-gesture co-occurrence frequency for a subset of linguistic expressions in American English.
Journal ArticleDOI

Temporal Expressions in English and Spanish: Influence of Typology and Metaphorical Construal.

TL;DR: English uses “deictic expressions with directional language” much more frequently than Spanish, to the extent that such directional information is often excluded in English-to-Spanish translations.
References
More filters

打磨Using Language,倡导新理念

付伶俐
TL;DR: Using Language部分的�’学模式既不落俗套,又能真正体现新课程标准所倡导的�'学理念,正是年努力探索的问题.
Book

Rethinking Context: Language As An Interactive Phenomenon

TL;DR: The indexical ground of Deictic Reference William F. Hanks as mentioned in this paper is a common ground for deictic reference in the context of discourse, and it has been used extensively in the field of context analysis.

Rethinking Context Language As An Interactive Phenomenon

Ralf Dresner
TL;DR: Thank you very much for reading rethinking context language as an interactive phenomenon, where people have look hundreds of times for their chosen novels, but end up in malicious downloads.
Journal ArticleDOI

CQPweb — combining power, flexibility and usability in a corpus analysis tool

TL;DR: An evaluation of CQPweb against criteria earlier laid down for a future web-based corpus analysis tool suggests that it fulfils many, but not all, of the requirements foreseen for such a piece of software.
Frequently Asked Questions (15)
Q1. What contributions have the authors mentioned in the paper "Toward an infrastructure for data-driven multimodal communication research" ?

This infrastructure makes it possible for researchers at multiple sites to work in real-time in transdisciplinary teams. The authors review the vision, progress, and prospects of this research consortium. 

video recordings can be submitted to a sketch filter, which removes textures critical to personal identification, yet retains structural elements of multimodal communication (Diemer et al. 2016). 

Some of these detectors make use of machine learning models that are learned from data using supervised or unsupervised learning methods. 

The metadata and annotations, along with the video and audio, can be accessed by Red Hen members through the Edge search engine (available via newsscape.library.ucla.edu), which provides an easy and userfriendly web-based user interface. 

Because television captions are typically created on the fly by professional captioners, they lag behind the speech and video stream by a low but variable number of seconds. 

The SEMAFOR project (http://www.ark.cs.cmu.edu/SEMAFOR) performs an automatic analysis of the framesemantic structure of English text, using the FrameNet 1.5 release. 

The availability of tagged and searchable multimodal big data opens up new opportunities for linguistics research, extending the utility of large corpora noted by Davies (2015). 

Red Hen’s infrastructure and tools also permit the incorporation of existing datasets, such as handcrafted collections of experimental data. 

In this paper, the authors describe the Distributed Little Red Hen Lab, a global laboratory and consortium designed to facilitate large-scale collaborative research into multimodal communication. 

Imperfections arise because the structure of frames in the videos in Red Hen is very complex, so that it is often extremely difficult to detect precisely small motions or parts of the body. 

Red Hen uses the open-source Gentle project (lowerquality.com/gentle) to align the text with the audio, generating precise timestamps for each word. 

Red Hen provides a search interface (Figure 1) aligned to this need, developed collaboratively by linguists and computer scientists on their team, an example of the kind of interdisciplinary collaboration common in Red Hen. 

In principle, any recording in any format of any human communication is suitable for inclusion in the archive, which consists of networked data across the Red Hen cooperative, either natively digital or converted to digital form. 

It is an official archive of the University of California, Los Angeles (UCLA) Library, the digital continuation of UCLA’s Communication Studies Archive, initiated by Paul Rosenthal in 1972. 

Access to the Red Hen tools and data are provided through the project website (redhenlab.org), where researchers can both access data and contribute or provide feedback to the Red Hen project. 

Trending Questions (1)
What are the most important research problems in multimodal analysis?

The most important research problems in multimodal analysis include resource-intensive data collection, lack of peer validation, absence of shared datasets, and small dataset sizes.