scispace - formally typeset
Open AccessPosted ContentDOI

Co-existence of prediction and error signals in electrophysiological responses to natural speech

Reads0
Chats0
TLDR
An analysis framework is developed that tests the effects of top-down linguistic information on the auditory encoding of continuous speech and reveals support for theories of predictive coding that propose that perception is first biased towards what is expected followed by what is informative.
Abstract
Prior knowledge facilitates perception and allows us to interpret our sensory environment. However, the neural mechanisms underlying this process remain unclear. Theories of predictive coding propose that feedback connections between cortical levels carry predictions about upcoming sensory events whereas feedforward connections carry the error between the prediction and the sensory input. Although predictive coding has gained much ground as a viable mechanism for perception, in the context spoken language comprehension it lacks empirical support using more naturalistic stimuli. In this study, we investigated theories of predictive coding using continuous, everyday speech. EEG recordings from human participants listening to an audiobook were analysed using a 2-stage regression framework. This tested the effect of top-down linguistic information, estimated using computational language models, on the bottom-up encoding of acoustic and phonetic speech features. Our results show enhanced encoding of both semantic predictions and surprising words, based on preceding context. This suggests that signals pertaining to prediction and error units can be observed in the same electrophysiological responses to natural speech. In addition, temporal analysis of these signals reveals support for theories of predictive coding that propose that perception is first biased towards what is expected followed by what is informative.

read more

Content maybe subject to copyright    Report

Co-existence of prediction and error signals in
electrophysiological responses to natural speech
Michael P. Broderick
1
and Edmund C. Lalor
1,2
1
School of Engineering, Trinity Centre for Biomedical Engineering and Trinity College Institute of Neuroscience, Trinity
College Dublin, Dublin 2, Ireland.
2
Department of Neuroscience, and Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY 14627,
USA.
Correspondence: Michael Broderick: brodermi@tcd.ie; Edmund Lalor: edmund_lalor@urmc.rochester.edu
Abstract
Prior knowledge facilitates perception and allows us to interpret our sensory environment. However,
the neural mechanisms underlying this process remain unclear. Theories of predictive coding propose
that feedback connections between cortical levels carry predictions about upcoming sensory events
whereas feedforward connections carry the error between the prediction and the sensory input. Although
predictive coding has gained much ground as a viable mechanism for perception, in the context spoken
language comprehension it lacks empirical support using more naturalistic stimuli. In this study, we
investigated theories of predictive coding using continuous, everyday speech. EEG recordings from
human participants listening to an audiobook were analysed using a 2-stage regression framework. This
tested the effect of top-down linguistic information, estimated using computational language models,
on the bottom-up encoding of acoustic and phonetic speech features. Our results show enhanced
encoding of both semantic predictions and surprising words, based on preceding context. This suggests
that signals pertaining to prediction and error units can be observed in the same electrophysiological
responses to natural speech. In addition, temporal analysis of these signals reveals support for theories
of predictive coding that propose that perception is first biased towards what is expected followed by
what is informative.
Significance Statement
Over the past two decades, predictive coding has grown in popularity as an explanatory mechanism for
perception. However, there has been lack of empirical support for this theory in research studying
natural speech comprehension. We address this issue by developing an analysis framework that tests
the effects of top-down linguistic information on the auditory encoding of continuous speech. Our
results provide evidence for the co-existence of prediction and error signals and support theories of
predictive coding using more naturalistic stimuli.
.CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted November 24, 2020. ; https://doi.org/10.1101/2020.11.20.391227doi: bioRxiv preprint

1. Introduction
A key question in neuroscience centers on how bottom-up sensory inputs combine with top-down prior
knowledge to subserve perception (Miller et al., 1951; Liberman et al., 1967). A popular idea is that
this is achieved through hierarchical Bayesian inference, whereby the brain infers the causes of its
sensory inputs by using an internal model of the world to generate predictions and then comparing those
predictions to incoming sensory input (Knill and Pouget, 2004; Aitchison and Lengyel, 2017). One
prominent mechanistic account known as predictive coding proposes that higher levels in the cortical
hierarchy predict lower-level bottom-up signals via feedback connections, whereas feedforward
connection convey only the error between top-down prediction and the bottom up sensory signal (Rao
and Ballard, 1999; Friston, 2005; Clark, 2013). A core tenet of this theory assumes the involvement of
two distinct populations of neurons at each cortical level: representation (prediction) units, which
encode predictions based on prior information, and error units, which encode the error (Rao and Ballard,
1999). Given this assumption, one should expect the co-presence of neural signals reflecting
computations from these distinct populations during perception. Indeed, while some evidence exists
(Egner et al., 2010), there is an overall lack of neuroimaging studies providing direct empirical support
for the simultaneous computation of prediction and prediction error. Thus, there is still ongoing debate
as to how prior information informs perception (Egner and Summerfield, 2013; Heilbron and Chait,
2017; Friston, 2018).
Studying speech processing in the brain represents a very powerful way to contribute to this debate.
This is because speech perception likely involves predictions across many levels of processing
(Kuperberg and Jaeger, 2016), and because the field of computational linguistics has given us methods
for quantifying how upcoming words may be predicted from their context (Bengio et al., 2003; Mikolov
et al., 2013; Buck et al., 2014; Pennington et al., 2014). In addition, pattern analysis techniques (Haxby,
2001; Kriegeskorte, 2008; Crosse et al., 2016) have helped adjudicate between predictive coding and
competing accounts of Bayesian inference by directly testing the stimulus features that are encoded in
the neural signal. Studies employing such approaches have produced mixed results, in some case
showing evidence of enhanced prediction signals (Kok et al., 2012; Leonard et al., 2016; Broderick et
al., 2019) and in other cases showing enhance error signals (Blank and Davis, 2016; Sohoglu and Davis,
2020). The notion of a two-process model where perception is first biased towards prior knowledge and
later upweights surprising events (Press et al., 2020) has the potential reconcile these seemingly
contrasting findings and also supports the co-existence of representation and error units.
Our current work aims to provide evidence for both representational and prediction error effects in the
same responses to natural speech. Specifically, we quantify the predictability of words using measures
from language models that have been previously linked to the computation of error and prediction. The
first of these measures known as surprisal quantifies the Kullach-Leibler divergence between prior
.CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted November 24, 2020. ; https://doi.org/10.1101/2020.11.20.391227doi: bioRxiv preprint

and posterior probability distributions and thus reflects the degree of belief updating a comprehender
undergoes when processing new incoming words in context (Levy, 2008; Kuperberg and Jaeger, 2016).
It has been linked formally with prediction error in theories of predictive coding (Friston, 2005). The
second measure known as semantic similarity is derived by comparing words to their context based
on word embedding models. It has been previously shown to relate to the predictive preactivation of
the semantic features of words, something that is thought to underlie the N400 response (Federmeier
and Kutas, 1999; Ettinger et al., 2016; Broderick et al., 2020). We assessed how these measures of
surprisal and semantic similarity affected the encoding of bottom-up acoustic features using a recently
developed analysis framework (Broderick et al., 2019). Importantly, the measures of semantic similarity
and surprisal assign higher values to more expected and less expected words, respectively, and are only
weakly (and negatively) correlated. However, our results reveal that both measures independently affect
the encoding of acoustic-phonetic information. Additionally, we see that representation and error are
dissociable based on their timing of their effects, with the representational effect preceding that of the
surprisal effect (i.e., the error). This is, again, something that has been hypothesized in the literature
(Press et al., 2020).
2. Materials and Methods
2.1 Participants
Data from 19 native English speakers (6 female; aged 19-38 years) who participated in a previous study
was reanalysed for this study (Di Liberto et al., 2015; Broderick et al., 2018). The study was undertaken
in accordance with the Declaration of Helsinki and was approved by the Ethics Committee of the School
of Psychology at Trinity College Dublin. Each subject provided written informed consent. Subjects
reported no history of hearing impairment or neurological disorder.
2.2 Stimuli and experimental procedure
Subjects undertook 20 trials, each just under 180 seconds in length, where they were presented with an
audio-book version of a popular mid-20th century American work of fiction (Hemingway, 1952), read
by a single male American speaker. The average speech rate was 210 words/minute. The mean length
of each content word was 340 ms with standard deviation of 127 ms. Trials were presented
chronologically to the story. All stimuli were presented diotically at a sampling rate of 44.1 kHz using
Sennheiser HD650 headphones and Presentation software from Neurobehavioural Systems. Testing
was carried out in a dark, sound attenuated room and subjects were instructed to maintain visual fixation
on a crosshair centred on the screen for the duration of each trial, and to minimise eye blinking and all
other motor activities.
.CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted November 24, 2020. ; https://doi.org/10.1101/2020.11.20.391227doi: bioRxiv preprint

2.3 EEG acquisition and preprocessing
128-channel EEG data were acquired at a rate of 512 Hz using an ActiveTwo system (BioSemi).
Triggers indicating the start of each trial were sent by the stimulus presentation computer and included
in the EEG recordings to ensure synchronization. Offline, the data were bandpass filtered between 1
and 8Hz using a Chebyshev Type II filter (order 54, cutoff 0.5Hz for high pass filtering and 8.5Hz for
low pass filtering). Passband attenuation was set to 1dB and stopband attenuation was set to 60dB (high
pass) and 80dB (low pass). After filtering, data were downsampled to 64 Hz (backward modelling) or
128 Hz (forward modelling; see below). To identify channels with excessive noise, the standard
deviation of the time series of each channel was compared with that of the surrounding channels. For
each trial, a channel was identified as noisy if its standard deviation was more than 2.5 times the mean
standard deviation of all other channels or less than the mean standard deviation of all other channels
divided by 2.5. Channels contaminated by noise were recalculated by spline interpolating the
surrounding clean channels in EEGLAB (Delorme and Makeig, 2004). The data were then re-referenced
to the global average of all channels.
2.4 Stimulus characterisation
Several acoustic and linguistic features were extracted from the speech signal and used as input at
various stages in a 2-stage regression analysis. These features can be categorised into 3 main groups
based on what stage they are used in the analysis.
1. The first group of features were regressed to the recorded EEG signal using the temporal response
function (see below), which learns a linear mapping between stimulus and neural response or vice
versa. Each feature was used to assess the encoding of speech at different hierarchical levels.
1a. Envelope. The broadband amplitude envelope of the speech signal was calculated using the
absolute value of the Hilbert transform.
1b. Spectrogram. The speech signal was filtered into 16 different frequency bands between
250Hz and 8kHz according to Greenwood’s equations (Greenwood, 1961). After filtering, the
amplitude envelope was calculated for each band using the absolute value of the Hilbert
Transform.
1c. Phonetic Features. To create the phonetic feature stimulus Prosodylab-Aligner software
(Gorman et al., 2011) was used. This automatically partitions each word in the story into
phonemes from the American English International Phonetic Alphabet (IPA) and performs
forced-alignment, returning the starting and ending time-points for each phoneme. Each phoneme
was then mapped to a corresponding set of 19 phonetic features, which was based on the
University of Iowa’s phonetics project http://prosodylab.cs.mcgill.ca/tools/aligner/.
.CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted November 24, 2020. ; https://doi.org/10.1101/2020.11.20.391227doi: bioRxiv preprint

2. The second group of features estimates higher-level linguistic properties of words and was used
in the second stage of the regression analysis.
2a. Semantic Similarity was calculated for each content word in the narrative. It is estimated from
word embeddings derived using GloVe (Pennington et al., 2014) which was trained on a large
corpus (Common Crawl https://commoncrawl.org/). Each word is represented as 300-
dimensional vector where each dimension can be thought to reflect some latent linguistic context.
A word’s similarity index is estimated as the Pearson’s correlation between the word’s vector
and the averaged vector of all the preceding words in the sentence (Broderick et al., 2018). Thus,
small similarity values signify out-of-context words that, by extension, are unexpected.
2b. Surprisal was calculated using a Markov model trained on the same corpus as GloVe
(Common Crawl). These models, commonly referred to as n-grams, estimate the conditional
probability of the next word in a sequence given the previous n-1 words. We applied a 5-gram
model that was produced using interpolated modified Kneser-Ney smoothing (Chen and
Goodman, 1996; Buck et al., 2014). Unlike semantic similarity, surprisal assigns higher values
to more unexpected words.
3. The third group of stimulus characterisations was also used in the 2nd stage regression analysis
to act as nuisance regressors. Their purpose was to soak up any variance in word level auditory
encoding relating to acoustic changes in the speaker’s voice.
3a. Envelope Variability was calculated by taking the standard deviation of the speech envelope
over each spoken word. Here, we wished to control for rapid changes in the envelope amplitude
as it has been shown that cortical responses monotonically increase with steeper acoustic edges
(Oganian and Chang, 2019).
3b. Relative Pitch was recently shown to be encoded in EEG (Teoh et al., 2019). It quantifies
pitch normalised according to the vocal range of the speaker. Praat software (Boersma and
Weenink, 2000) was used to extract a continuous measure of pitch (absolute pitch). The measure
was then normalized to zero mean and unit standard deviation (z-units) to obtain relative pitch.
3c. Resolvability measures whether the harmonics of a sound can be processed within distinct
filters of the cochlea (resolved) or if they interact within the same filter (unresolved). It has
previously been shown using fMRI that pitch responses in auditory cortex are predominantly
driven by resolved frequency components (Norman-Haignere et al., 2013). Custom written
scripts from an acoustic statistics toolbox from the same study were used to extract a continuous
measure of harmonic resolvability.
.CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted November 24, 2020. ; https://doi.org/10.1101/2020.11.20.391227doi: bioRxiv preprint

Figures
Citations
More filters

The Old Man and the Seaにおける文体

光枝 金子
TL;DR: The following sections of this BookRags Premium Study Guide is offprint from Gale's For Students Series: Presenting Analysis, Context, and Criticism on Commonly Studied Works: Introduction, Author Biography, Plot Summary, Characters, Themes, Style, Historical Context, Critical Overview, Criticism and Critical Essays, Media Adaptations, Topics for Further Study, Compare & Contrast, What Do I Read Next?, For Further Study and Sources as mentioned in this paper.
Posted ContentDOI

Electrophysiological indices of hierarchical speech processing differentially reflect the comprehension of speech in noise

TL;DR: In this paper , EEG data from neurotypical adults listening to segments of an audiobook in different levels of background noise was used to study the relationship between brain and behavior by comprehensively linking hierarchical indices of neural speech processing to language comprehension metrics.
References
More filters
Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Journal ArticleDOI

EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis.

TL;DR: EELAB as mentioned in this paper is a toolbox and graphic user interface for processing collections of single-trial and/or averaged EEG data of any number of channels, including EEG data, channel and event information importing, data visualization (scrolling, scalp map and dipole model plotting, plus multi-trial ERP-image plots), preprocessing (including artifact rejection, filtering, epoch selection, and averaging), Independent Component Analysis (ICA) and time/frequency decomposition including channel and component cross-coherence supported by bootstrap statistical methods based on data resampling.
Journal ArticleDOI

A neural probabilistic language model

TL;DR: The authors propose to learn a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences, which can be expressed in terms of these representations.
Related Papers (5)
Frequently Asked Questions (3)
Q1. What have the authors contributed in "Co-existence of prediction and error signals in electrophysiological responses to natural speech" ?

In this study, the authors investigated theories of predictive coding using continuous, everyday speech. However, there has been lack of empirical support for this theory in research studying natural speech comprehension. The authors address this issue by developing an analysis framework that tests the effects of top-down linguistic information on the auditory encoding of continuous speech. It is made available under a preprint ( which was not certified by peer review ) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted November 24, 2020. This suggests that signals pertaining to prediction and error units can be observed in the same electrophysiological responses to natural speech. 

Prosodic pitch processing is represented in delta-bandEEG and is dissociable from the cortical tracking of other acoustic and phonetic features. 

J Exp Psychol Learn Mem Cogn 25:394–417.Wang L, Kuperberg GR, Jensen O (2018) Specific lexico-semantic predictions are associated withunique spatial and temporal patterns of neural activity. 

Trending Questions (1)
How sharpening and predictive coding differ in neurophysiological study?

In neurophysiological studies, sharpening enhances specific features, while predictive coding involves feedback predictions and error signals based on prior knowledge during natural speech perception.