scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Inferring colocation and conversation networks from privacy-sensitive audio with implications for computational social science

TL;DR: New methods for inferring colocation and conversation networks from privacy-sensitive audio are applied in a study of face-to-face interactions among 24 students in a graduate school cohort during an academic year, and show that networks derived from colocations and conversation inferences are quite different.
Abstract: New technologies have made it possible to collect information about social networks as they are acted and observed in the wild, instead of as they are reported in retrospective surveys. These technologies offer opportunities to address many new research questions: How can meaningful information about social interaction be extracted from automatically recorded raw data on human behaviorq What can we learn about social networks from such fine-grained behavioral dataq And how can all of this be done while protecting privacyq With the goal of addressing these questions, this article presents new methods for inferring colocation and conversation networks from privacy-sensitive audio. These methods are applied in a study of face-to-face interactions among 24 students in a graduate school cohort during an academic year. The resulting analysis shows that networks derived from colocation and conversation inferences are quite different. This distinction can inform future research in computational social science, especially work that only measures colocation or employs colocation data as a proxy for conversation networks.

Summary (4 min read)

1. INTRODUCTION

  • The automated recording of real-world speech is crucial because, despite the rise in on-line interactions, face-to-face communication is still people’s primary mode of social interaction [Baym et al. 2004].
  • This requirement gives rise to two problems.
  • Ideally, a privacysensitive recording technique will process incoming audio in order to discard any information deemed too invasive while still preserving data useful for sociological inquiry.

2. PRIVACY-SENSITIVE CONVERSATION MODELING

  • When collecting situated conversation data it is necessary to protect the privacy of not just people who willingly consent to wear a recording device, but also of those who may come within range of the microphones.
  • For this purpose, destructive processing of the audio should yield a feature set that prevents us from reconstructing intelligible speech or inferring the identities of anyone not wearing a device.
  • At the same time, the features must still contain enough information to allow conversations to be found and meaningful inferences made about those ACM Transactions on Intelligent Systems and Technology, Vol. 2, No. 1, Article 7, Pub. date: January 2011. conversations.
  • Energy can reveal a person or group’s interest in the conversation [GaticaPerez et al. 2005].
  • The method proposed by Corman and Scott [1994] computes normalized cross-correlation between raw audio signals and concludes that two people are in a conversation if their correlation coefficients are above a threshold estimated from labeled data.

2.1 Privacy-Sensitive Features

  • Following Basu [2002], their approach to extracting non-linguistic speech information builds on methods for detecting voiced human speech.
  • In a spectrogram, time runs along the x-axis and frequencies increase along the y-axis; color indicates energy at a given frequency.
  • The harmonicity in the spectrogram shows that voiced speech has a low spectral entropy, compared to non-voiced regions.
  • Narrow spectrum noise can also create strong autocorrelation peaks.
  • The precise procedure for computing features is as follows:.

2.2 Extracting Conversation Data

  • To gather data about face-to-face conversations, the authors ask multiple people to wear recording devices each of which saves separate streams of the privacy-sensitive features described above.
  • Finally, once colocated groups and speakers have been identified, the authors can conclude that people who are colocated and speaking are in conversation together and then extract further features of their conversation (Section 2.3).
  • For each recorded stream, the authors use the forward-backward algorithm [Rabiner 1989] to infer p(V ta|xa): the posterior probability of voiced speech in each frame, given the entire recorded stream.
  • Successful colocation detection requires clustering together segments of data from miked individuals when they are in a conversation.
  • There is a sharp peak of high mutual information values ACM Transactions on Intelligent Systems and Technology, Vol. 2, No. 1, Article 7, Pub. date: January 2011.

2.3 Conversation Data

  • The steps described so far provide ways of determining who is physically colocated with whom and who is speaking when, but they do not provide a method for determining who is in conversation with whom.
  • Thus, an evaluation comparing the resulting inferred conversations to the “in conversation with” ground truth label yields exactly the same results as comparing their voicing-based colocation inference to the “in conversation with” ground truth.
  • The table of those results is identical to Table II, and thus the authors omit it for space.

3. THE SPOKEN NETWORKS CORPUS

  • Using the conversation detection methods from the previous section, the authors collected a corpus of real-world face-to-face conversations among 24 research subjects over an extended period of time.
  • This section first contrasts their effort with earlier data collection projects (Section 3.1), it then explains the procedure used to gather the data (Section 3.2), provides summary statistics about the data (Section 3.3), and shows novel measures of social behavior that can be easily extracted form the data (Section 3.4).

3.2 Data Collection Method

  • The data collection effort presented in this work descends from the original sociometer study, but differs in the research context and design.
  • The subjects recorded data everywhere they went, indoors and outdoors: class, lunch, study groups, meetings, spontaneous social gatherings, etc. Data was saved to a 2-GB Secure Digital (SD) flash memory card on the PDA.
  • Research subjects completed questionnaires before beginning the school year, at the end of each data collection week, and following the end of the school year.
  • All conversation data discussed here was collected using the same platform: an HP iPAQ hx4700 PDA with an attached multi-sensor board containing eight different sensors.
  • Because all of the PDA’s software and settings are stored in volatile RAM and are completely lost if the battery fully discharges, this led to many Monday mornings of lost recording time while PDAs were reconfigured.

3.3 Collected Data

  • Figure 8 shows beanplots [Kampstra 2008] of the average number of hours collected per day for each collection week.
  • The first three weeks (i.e., representing the first three months of the academic year) show an increase in the amount of data collected as the subjects initially learned how to use the devices and the authors resolved battery and software problems as previously described.
  • Recording hours diminished slightly in the later weeks, also due partly to technical problems with the cables and perhaps because the participants became fatigued or the study became less novel to them.
  • While there is no moment when all subjects are recording (the maximum number of simultaneous recordings is 21), there is enough overlap in the data for it to contain many interactions among their subjects.

3.4 Basic Behavioral Inferences

  • Data processing follows the three steps described in Section 2: colocation detection, speaker segmentation, and conversation extraction.
  • Recall from ACM Transactions on Intelligent Systems and Technology, Vol. 2, No. 1, Article 7, Pub. date: January 2011.
  • During this academic term, most subjects attended a class that met from 10:30 am to 12:00 pm on these days, so many students arrived at school and began recording before that class.
  • Figure 11 shows the inferences for colocation using both energy and voicing mutual information, as well as the conversation grouping.
  • Because of that, when few people are recording even a small group of interacting subjects will appear as a larger proportion in the plot.

3.5 Basic Network Analyses

  • Constructing networks from survey data is usually simple: they are often just the union of self-reported ties for each actor in the network.
  • As with the edge value distributions, the values for the colocation degrees are much higher than those for conversation degrees and the two kinds networks seem to be very different with regard to degree.
  • As with the clustering coefficient, the triangle count does not generalize as easily to weighted networks as degree and density, but, following Saramäki et al. [2007], the authors can define a weighted triangle value as Tijk = (YijYikY jk)1/3 (19).
  • That difference may explain the bimodal colocation degree distributions of weeks 4 through 7 ), where there seems to be a distinction between pairs who spend much time together and pairs that only come together in passing.

4. CONCLUSION

  • The authors have outlined a set of privacy-sensitive features that can be computed from incoming audio data in real-time.
  • The authors have shown how to use those features to determine who was physically colocated with whom, both at the granularity of a room in a building and at the more elastic “acoustic proximity” needed to have a face-to-face conversation.
  • The authors have used the privacy sensitive features to infer who was speaking when, and combined those inferences with colocation inference to determine who was in conversation with whom.
  • This conversation detection can handle conversations with any number of participants, extending beyond previous methods that were limited to dyadic conversations only.
  • The authors constructed weighted networks of social behavior and examined basic descriptive statistics in order to compare social networks defined by colocation events to networks defined by conversation events.

Did you find this useful? Give us your feedback

Figures (24)

Content maybe subject to copyright    Report

7
Inferring Colocation and Conversation
Networks from Privacy-Sensitive Audio
with Implications for Computational
Social Science
DANNY WYATT
University of Washington
TANZEEM CHOUDHURY
Dartmouth College
JEFF BILMES
University of Washington
and
JAMES A. KITTS
Columbia University
New technologies have made it possible to collect information about social networks as they are
acted and observed in the wild, instead of as they are reported in retrospective surveys. These
technologies offer opportunities to address many new research questions: How can meaningful
information about social interaction be extracted from automatically recorded raw data on human
behavior? What can we learn about social networks from such fine-grained behavioral data? And
how can all of this be done while protecting privacy? With the goal of addressing these questions,
this article presents new methods for inferring colocation and conversation networks from privacy-
sensitive audio. These methods are applied in a study of face-to-face interactions among 24 students
in a graduate school cohort during an academic year. The resulting analysis shows that networks
derived from colocation and conversation inferences are quite different. This distinction can inform
future research in computational social science, especially work that only measures colocation or
employs colocation data as a proxy for conversation networks.
This work was supported by NSF grants IIS-0433637 and IIS-0845683.
Authors’ addresses: D. Wyatt, University of Washington, Department of Computer Science and
Engineering, Box 352350, Seattle, WA 98195-2350; email: danny@cs.washington.edu; T. Choud-
hury, Dartmouth College, 6211 Sudikoff Lab, Hanover, NH 03755; email: tanzeem.choudhury@
dartmouth.edu; J. A. Bilmes, University of Washington, Seallte, Department of Electrical Engi-
neering, Box 352500, Seattle, WA 98195-2500; email: bilmes@u.washington.edu; James A. Kitts,
Columbia University, Graduate School of Business, 704 Uris Hall, 3022 Broadway, New York, NY
10027; email: jak2190@columbia.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use
is granted without fee provided that copies are not made or distributed for profit or commercial
advantage and that copies s how this notice on the first page or initial screen of a display along
with the full citation. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,
to redistribute to lists, or to use any component of this work in other works requires prior specific
permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn
Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org.
C
2011 ACM 2157-6904/2011/01-ART7 $10.00
DOI 10.1145/1889681.1889688 http://doi.acm.org/10.1145/1889681.1889688
ACM Transactions on Intelligent Systems and Technology, Vol. 2, No. 1, Article 7, Pub. date: January 2011.

7:2
D. Wyatt et al.
Categories and Subject Descriptors: G.3 [Probability and Statistics]: Probabilistic Algorithms;
H.5.5 [Information Interfaces and Presentation]: Sound and Music Computing
General Terms: Algorithms, Experimentation, Human Factors, Security
Additional Key Words and Phrases: Social networks, mobile sensing
ACM Reference Format:
Wyatt, D., Choudhury, T., Bilmes, J., and Kitts, J. A. 2011. Inferring colocation and conversation
networks from privacy-sensitive audio with implications for computational social science. ACM
Trans. Intell. Syst. Technol. 2, 1, Article 7 (January 2011), 41 pages.
DOI = 10.1145/1889681.1889688 http://doi.acm.org/10.1145/1889681.1889688
1. INTRODUCTION
Much social network research has relied on data collected via surveys that ask
subjects to report their social ties (e.g., Goodreau et al. [2009]) or recall their pre-
vious social interactions (e.g., Lazega and van Duijn [1997]). When self-reports
of recalled interactions have been compared to independent observations, how-
ever, the reliability of subjects’ answers has been shockingly poor [Killworth
and Bernard 1976; Bernard and Killworth 1977, 1979; Bernard et al. 1980,
1982]. An early study came to the dire conclusion that “people do not know, with
any accuracy, those with whom they communicate” [Bernard and Killworth
1977]. Later studies found that durable, long-term patterns of communica-
tion are reliably reported, but moment-to-moment social interactions are not
[Freeman et al. 1987]. More troubling for research into network structure, in-
dividuals tend to “fill in” non-existent interactions if they would increase the
transitivity of the network [Freeman 1992]. Faced with these challenges, some
researchers lamented that “unfortunately, most naturally occurring interactive
behavior (the stuff of which networks are built) is neither observable nor con-
veniently recorded in some automated fashion” [Killworth and Bernard 1979].
That statement is no longer true. New technologies have made it possible
to collect information about social behavior as it is enacted, instead of as it is
recalled after-the-fact. Phone calls, text messages, emails, instant messages,
on-line chat sessions, social media posts, and any other kind of electronically
mediated communication can be automatically recorded for large groups of
people, over long periods of time. Portable audio recording devices have grown
in capacity while becoming smaller, cheaper, and more powerful, making it
easier to record face-to-face conversations. In fact, wearable sensors now allow
us to automatically record natural and spontaneous speech for an entire group
of people over a long period of time.
The automated recording of real-world speech is crucial because, despite the
rise in on-line interactions, face-to-face communication is still people’s primary
mode of social interaction [Baym et al. 2004]. Unlike methods previously em-
ployed for speech data derived from laboratory contexts, our proposed method
would capture truly spontaneous speech that arises in situ as people enact their
actual, lived relationships. For that reason, we refer to such data as situated
speech data—data gathered in the wild—to contrast it with other speech data
recorded in constrained or contrived settings.
ACM Transactions on Intelligent Systems and Technology, Vol. 2, No. 1, Article 7, Pub. date: January 2011.

Inferring Colocation and Conversation Networks from Privacy-Sensitive Audio
7:3
Of course, obstacles to gathering situated spontaneous speech still remain,
especially concerns about privacy. To capture truly natural interactions while
providing a full picture of a social network, we must record people as they
freely go about their lives. This requirement gives rise to two problems. First,
uninvolved parties could be recorded without their consent—a scenario that,
if raw audio is involved, is always unethical and often illegal. Second, people
may change their behavior if they know they are being recorded. For both
of those reasons, a level of privacy must be maintained. Ideally, a privacy-
sensitive recording technique will process incoming audio in order to discard
any information deemed too invasive while still preserving data useful for
sociological inquiry.
This dilemma illuminates what is perhaps a fundamental trade-off between
privacy and quality when automatically recording social behavior. Subjects are
unlikely to consent to large-scale, unrestricted recordings of their behavior,
so some sociologically useful information must almost surely be destroyed.
Therefore, set of features that allow us to balance this trade-off between privacy
and quality is needed.
In this article, we present exactly such a set of privacy-sensitive features,
together with a method for using this feature set to find colocation and conver-
sation events in separately recorded streams of audio data. In evaluation using
non-privacy-preserving test data—where access to ground truth is possible—
our method performs better than earlier methods.
We then use our proposed method to derive networks of colocation and face-
to-face conversation among 24 graduate students over the course of an academic
year. Networks created from colocations and conversations appear to be quite
different—a result that can impact and inform future research in computa-
tional social science.
The remainder of this article is divided into two broad sections. Section 2
presents the methods for discovering physically colocated and conversing peo-
ple from privacy-sensitive audio data, then assesses the performance of these
methods. Section 3 covers the Spoken Networks project, a data collection effort
that employed the proposed methods to study a real-world network over an
extended period of time, demonstrating new insights that may be available
through these lenses.
2. PRIVACY-SENSITIVE CONVERSATION MODELING
When collecting situated conversation data it is necessary to protect the privacy
of not just people who willingly consent to wear a recording device, but also
of those who may come within range of the microphones. For this purpose,
destructive processing of the audio should yield a feature set that prevents us
from reconstructing intelligible speech or inferring the identities of anyone not
wearing a device. A further constraint on the feature set is that all features
must be computed in real-time within the limited computational resources of
a wearable device—no raw audio should ever be stored, even temporarily.
At the same time, the features must still contain enough information to
allow conversations to be found and meaningful inferences made about those
ACM Transactions on Intelligent Systems and Technology, Vol. 2, No. 1, Article 7, Pub. date: January 2011.

7:4
D. Wyatt et al.
0 500 1000 1500 2000 2500 30000 500 1000 1500 2000 2500 30000 500 1000 1500 2000 2500 3000
Frequency (Hz) Frequency (Hz) Frequency (Hz)
Source Filter Speech
Fig. 1. Conceptual schematic of the source-filter model.
conversations. Fortunately, the nonlinguistic aspects of a conversation—who
speaks when and for how long, how loud, and at what pitch—still allow for
many useful analyses. Interruptions and speaking time reveal information
about status and dominance [Hawkins 1991]. Speaking rate reveals informa-
tion about a speaker’s level of mental activity [Hurlburt et al. 2002]. Energy
(loudness) can reveal a person or group’s interest in the conversation [Gatica-
Perez et al. 2005]. Pitch may be used for inferring emotion [Dellaert et al. 1996],
and energy and duration of voiced and unvoiced regions are also informative
emotional features [Schuller et al. 2004].
Here we present a set of privacy-sensitive features that can be extracted
from an audio stream in real-time (Section 2.1), along with methods for using
those features to automatically determine who is in conversation with whom
(Section 2.2) and how people are speaking (Section 2.3).
Related Work. To the best of our knowledge, prior to this research, there
were only two existing methods for finding conversations in separately recorded
streams of audio. The method proposed by Corman and Scott [1994] computes
normalized cross-correlation between raw audio signals and concludes that
two people are in a conversation if their correlation coefficients are above a
threshold estimated from labeled data. Obviously, using raw audio does not
protect privacy, but a privacy-sensitive variant of their method is considered
below. Similarly, the method proposed by Basu [2002] computes t he mutual
information between binary signals that represent voiced/unvoiced speech and
places two people in a conversation if their mutual information is above a
pre-specified threshold. Our work extends Basu’s method in three important
ways: (i) to detect multiperson conversations (not just dyadic), (ii) to operate at
a finer time granularity while still producing a “smooth” inference over time,
and (iii) to learn its threshold in an unsupervised manner.
2.1 Privacy-Sensitive Features
Following Basu [2002], our approach to extracting non-linguistic speech infor-
mation builds on methods for detecting voiced human speech. A basic model for
the production of human speech is the standard source-filter model [Quatieri
2001] shown in Figure 1. As its name suggests, the source-filter model posits
two semi-independent components of speech production: (1) a source sound
that is generated by the glottis and passed through (2) the filter, realized by
the vocal tract, that shapes the spectrum of the source.
ACM Transactions on Intelligent Systems and Technology, Vol. 2, No. 1, Article 7, Pub. date: January 2011.

Inferring Colocation and Conversation Networks from Privacy-Sensitive Audio
7:5
The source can be voiced or unvoiced. If it is voiced, the vocal cords are
vibrating at what i s called the fundamental frequency, or F0, which constitutes
the pitch at which the person is speaking. A sequence of speech will alternate
rapidly between voiced and unvoiced segments. Prosodic features of speech—
intonation, stress, duration—are described by how the fundamental frequency
and energy (volume) change during speech.
The source sound is shaped into words by changing the shape of the vocal
tract. It is the frequency response of the vocal tract, particularly the resonant
peaks known as formants, that contains information about the phonemes that
are constituent parts of spoken words. Any processing of the audio that removes
information about the formants will ensure that intelligible speech cannot be
synthesized from the signal that remains.
Thus, to find conversations and retain information about how people are
speaking, we save information about the source while discarding (almost) all
information about the filter. We argue below that this preserves sociologically
useful information.
The first step in that process is finding voiced speech. Figure 2(a) shows the
spectrogram for a male voice saying the phrase “University of Washington Spo-
ken Networks. In a spectrogram, time runs along the x-axis and frequencies
increase along the y-axis; color indicates energy at a given frequency. In this ex-
ample, all of the phonemes are voiced except those for “s,” “t, “sh, “p, and “k.”
The strong harmonics are indicators of voiced speech and we take advantage
of that harmonicity to find segments of voiced speech.
Three features that have been shown to be useful for robustly detecting
voiced speech under varying noise conditions are: (i) noninitial maximum auto-
correlation peak, (ii) the total number of autocorrelation peaks, and (iii) relative
spectral entropy [Basu 2002]. To provide an intuition for the first two features,
Figure 2(b) shows the autocorrelogram for the example phrase. As in the spec-
trogram, time runs along the x-axis. The y-axis shows increasing lags at which
the autocorrelation is computed, and colors show the value of the autocorrela-
tion. The voiced segments show fewer, stronger peaks. All three features are
shown in Figure 2(c). During voiced segments, the number of autocorrelation
peaks drops, while the maximum autocorrelation value and relative spectral
entropy rise.
The harmonicity in the spectrogram shows that voiced speech has a low
spectral entropy, compared to non-voiced regions. However, in many environ-
ments there can be noise centered strongly at a specific frequency. Figure 2(a)
shows two possible examples of such noise: a low frequency hum (from 300
to 500 Hz) that may be an air conditioner, and a sharp high frequency noise
(around 6400 Hz) that is probably a computer fan or hard drive. Such narrow
spectrum noise will lower the general environmental spectral entropy. The rel-
ative s pectral entropy is the relative entropy (also known as Kullback-Leibler
divergence, see Eq. (2)) between an instantaneous normalized spectrum and
the mean normalized spectrum for a much longer window of time. Relative
spectral entropy captures the quick change in entropy caused by short seg-
ments of voiced speech while smoothing away any environmental reductions in
entropy. Narrow spectrum noise can also create strong autocorrelation peaks.
ACM Transactions on Intelligent Systems and Technology, Vol. 2, No. 1, Article 7, Pub. date: January 2011.

Citations
More filters
Journal ArticleDOI
TL;DR: Three studies use location and communication sensors to model individual behaviors and symptoms, long-term health outcomes, and the diffusion of opinions in a community because the underlying sensing technologies are now commonplace and readily available.
Abstract: Mobile phones are a pervasive platform for opportunistic sensing of behaviors and opinions. Three studies use location and communication sensors to model individual behaviors and symptoms, long-term health outcomes, and the diffusion of opinions in a community. These three analyses illustrate how mobile phones can unobtrusively monitor rich social interactions, because the underlying sensing technologies are now commonplace and readily available.

237 citations


Cites background from "Inferring colocation and conversati..."

  • ...Some recent examples include automatically inferring co-location and conversational networks [14], linking social diversity and economic progress [6], automatic activity and event classification for...

    [...]

Proceedings ArticleDOI
17 Sep 2011
TL;DR: The feasibility of a multi-modal mobile sensing system to simultaneously assess mental and physical health is shown and different metrics that reflect the results of commonly used surveys for assessing well-being by the medical community are derived.
Abstract: The idea of continuously monitoring well-being using mobile-sensing systems is gaining popularity. In-situ measurement of human behavior has the potential to overcome the short comings of gold-standard surveys that have been used for decades by the medical community. However, current sensing systems have mainly focused on tracking physical health; some have approximated aspects of mental health based on proximity measurements but have not been compared against medically accepted screening instruments. In this paper, we show the feasibility of a multi-modal mobile sensing system to simultaneously assess mental and physical health. By continuously capturing fine-grained motion and privacy-sensitive audio data, we are able to derive different metrics that reflect the results of commonly used surveys for assessing well-being by the medical community. In addition, we present a case study that highlights how errors in assessment due to the subjective nature of the responses could potentially be avoided by continuous mobile sensing.

211 citations

Journal ArticleDOI
TL;DR: Automatic smartphone sensing is a feasible approach for inferring rhythmicity, a key marker of wellbeing for individuals with bipolar disorder.

177 citations


Cites background from "Inferring colocation and conversati..."

  • ..., spectral content and regularity, loudness) that are useful for detecting the presence of human voice but insufficient to reconstruct speech content.(16) Using these privacy-sensitive audio features and probabilistic inference techniques, it is possible to reliably estimate the number of conversations an individual engages in, the duration of the conversations, and how much time a given individual speaks within a conversation along with speaking rate and variations in pitch....

    [...]

  • ...Using these privacy-sensitive audio features and probabilistic inference techniques, it is possible to reliably estimate the number of conversations an individual engages in, the duration of the conversations, and how much time a given individual speaks within a conversation along with speaking rate and variations in pitch.(16) These features have been used to detect social isolation in older adults....

    [...]

Journal ArticleDOI
TL;DR: This work proposes the definition of four different dimensions, namely Pattern & Knowledge discovery, Information Fusion & Integration, Scalability, and Visualization, which are used to define a set of new metrics (termed degrees) in order to evaluate the different software tools and frameworks of SNA.

134 citations


Cites background from "Inferring colocation and conversati..."

  • ...The other main application of audio analysis is generating conversational networks from raw audio data [367, 368]....

    [...]

Proceedings ArticleDOI
25 Jun 2013
TL;DR: It is noticed that online turn monitoring is the basic building block for extracting diverse meta-linguistic contexts, and a novel volume-topography-based method is devised, which is highly accurate and energy-efficient under diverse real-life situations.
Abstract: In this paper, we propose SocioPhone, a novel initiative to build a mobile platform for face-to-face interaction monitoring. Face-to-face interaction, especially conversation, is a fundamental part of everyday life. Interaction-aware applications aimed at facilitating group conversations have been proposed, but have not proliferated yet. Useful contexts to capture and support face-to-face interactions need to be explored more deeply. More important, recognizing delicate conversational contexts with commodity mobile devices requires solving a number of technical challenges. As a first step to address such challenges, we identify useful meta-linguistic contexts of conversation, such as turn-takings, prosodic features, a dominant participant, and pace. These serve as cornerstones for building a variety of interaction-aware applications. SocioPhone abstracts such useful meta-linguistic contexts as a set of intuitive APIs. Its runtime efficiently monitors registered contexts during in-progress conversations and notifies applications on-the-fly. Importantly, we have noticed that online turn monitoring is the basic building block for extracting diverse meta-linguistic contexts, and have devised a novel volume-topography-based method. We show the usefulness of SocioPhone with several interesting applications: SocioTherapist, SocioDigest, and Tug-of-War. Also, we show that our turn-monitoring technique is highly accurate and energy-efficient under diverse real-life situations.

123 citations

References
More filters
Journal ArticleDOI
04 Jun 1998-Nature
TL;DR: Simple models of networks that can be tuned through this middle ground: regular networks ‘rewired’ to introduce increasing amounts of disorder are explored, finding that these systems can be highly clustered, like regular lattices, yet have small characteristic path lengths, like random graphs.
Abstract: Networks of coupled dynamical systems have been used to model biological oscillators, Josephson junction arrays, excitable media, neural networks, spatial games, genetic control networks and many other self-organizing systems. Ordinarily, the connection topology is assumed to be either completely regular or completely random. But many biological, technological and social networks lie somewhere between these two extremes. Here we explore simple models of networks that can be tuned through this middle ground: regular networks 'rewired' to introduce increasing amounts of disorder. We find that these systems can be highly clustered, like regular lattices, yet have small characteristic path lengths, like random graphs. We call them 'small-world' networks, by analogy with the small-world phenomenon (popularly known as six degrees of separation. The neural network of the worm Caenorhabditis elegans, the power grid of the western United States, and the collaboration graph of film actors are shown to be small-world networks. Models of dynamical systems with small-world coupling display enhanced signal-propagation speed, computational power, and synchronizability. In particular, infectious diseases spread more easily in small-world networks than in regular lattices.

39,297 citations

Journal ArticleDOI
Lawrence R. Rabiner1
01 Feb 1989
TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.
Abstract: This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described. >

21,819 citations

Journal ArticleDOI
27 Mar 2006
TL;DR: The ability to use standard Bluetooth-enabled mobile telephones to measure information access and use in different contexts, recognize social patterns in daily user activity, infer relationships, identify socially significant locations, and model organizational rhythms is demonstrated.
Abstract: We introduce a system for sensing complex social systems with data collected from 100 mobile phones over the course of 9 months. We demonstrate the ability to use standard Bluetooth-enabled mobile telephones to measure information access and use in different contexts, recognize social patterns in daily user activity, infer relationships, identify socially significant locations, and model organizational rhythms.

2,959 citations

Journal ArticleDOI
06 Jan 2006-Science
TL;DR: This work analyzed a dynamic social network comprising 43,553 students, faculty, and staff at a large university, in which interactions between individuals are inferred from time-stamped e-mail headers recorded over one academic year and are matched with affiliations and attributes.
Abstract: Social networks evolve over time, driven by the shared activities and affiliations of their members, by similarity of individuals' attributes, and by the closure of short network cycles. We analyzed a dynamic social network comprising 43,553 students, faculty, and staff at a large university, in which interactions between individuals are inferred from time-stamped e-mail headers recorded over one academic year and are matched with affiliations and attributes. We found that network evolution is dominated by a combination of effects arising from network topology itself and the organizational structure in which the network is embedded. In the absence of global perturbations, average network properties appear to approach an equilibrium state, whereas individual properties are unstable.

1,713 citations

Journal ArticleDOI
05 Apr 2007-Nature
TL;DR: The focus is on networks capturing the collaboration between scientists and the calls between mobile phone users, and it is found that large groups persist for longer if they are capable of dynamically altering their membership, suggesting that an ability to change the group composition results in better adaptability.
Abstract: The rich set of interactions between individuals in society results in complex community structure, capturing highly connected circles of friends, families or professional cliques in a social network. Thanks to frequent changes in the activity and communication patterns of individuals, the associated social and communication network is subject to constant evolution. Our knowledge of the mechanisms governing the underlying community dynamics is limited, but is essential for a deeper understanding of the development and self-optimization of society as a whole. We have developed an algorithm based on clique percolation that allows us to investigate the time dependence of overlapping communities on a large scale, and thus uncover basic relationships characterizing community evolution. Our focus is on networks capturing the collaboration between scientists and the calls between mobile phone users. We find that large groups persist for longer if they are capable of dynamically altering their membership, suggesting that an ability to change the group composition results in better adaptability. The behaviour of small groups displays the opposite tendency-the condition for stability is that their composition remains unchanged. We also show that knowledge of the time commitment of members to a given community can be used for estimating the community's lifetime. These findings offer insight into the fundamental differences between the dynamics of small groups and large institutions.

1,676 citations

Frequently Asked Questions (1)
Q1. What are the contributions in "Inferring colocation and conversation networks from privacy-sensitive audio with implications for computational social science" ?

New technologies have made it possible to collect information about social networks as they are acted and observed in the wild, instead of as they are reported in retrospective surveys. With the goal of addressing these questions, this article presents new methods for inferring colocation and conversation networks from privacysensitive audio. These methods are applied in a study of face-to-face interactions among 24 students in a graduate school cohort during an academic year. This distinction can inform future research in computational social science, especially work that only measures colocation or employs colocation data as a proxy for conversation networks.