What is the acoustic and phonetic coding?

Prosodic pitch processing is represented in delta-bandEEG and is dissociable from the cortical tracking of other acoustic and phonetic features.

What is the elife of the lexico-semantic predictions?

J Exp Psychol Learn Mem Cogn 25:394–417.Wang L, Kuperberg GR, Jensen O (2018) Specific lexico-semantic predictions are associated withunique spatial and temporal patterns of neural activity.

(Open Access) Co-existence of prediction and error signals in electrophysiological responses to natural speech (2020) | Michael Broderick

Q: What have the authors contributed in "Co-existence of prediction and error signals in electrophysiological responses to natural speech" ?

In this study, the authors investigated theories of predictive coding using continuous, everyday speech. However, there has been lack of empirical support for this theory in research studying natural speech comprehension. The authors address this issue by developing an analysis framework that tests the effects of top-down linguistic information on the auditory encoding of continuous speech. It is made available under a preprint ( which was not certified by peer review ) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted November 24, 2020. This suggests that signals pertaining to prediction and error units can be observed in the same electrophysiological responses to natural speech.

Co-existence of prediction and error signals in

electrophysiological responses to natural speech

Michael P. Broderick

and Edmund C. Lalor

1,2

School of Engineering, Trinity Centre for Biomedical Engineering and Trinity College Institute of Neuroscience, Trinity

College Dublin, Dublin 2, Ireland.

Department of Neuroscience, and Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY 14627,

USA.

Correspondence: Michael Broderick: brodermi@tcd.ie; Edmund Lalor: edmund_lalor@urmc.rochester.edu

Abstract

Prior knowledge facilitates perception and allows us to interpret our sensory environment. However,

the neural mechanisms underlying this process remain unclear. Theories of predictive coding propose

that feedback connections between cortical levels carry predictions about upcoming sensory events

whereas feedforward connections carry the error between the prediction and the sensory input. Although

predictive coding has gained much ground as a viable mechanism for perception, in the context spoken

language comprehension it lacks empirical support using more naturalistic stimuli. In this study, we

investigated theories of predictive coding using continuous, everyday speech. EEG recordings from

human participants listening to an audiobook were analysed using a 2-stage regression framework. This

tested the effect of top-down linguistic information, estimated using computational language models,

on the bottom-up encoding of acoustic and phonetic speech features. Our results show enhanced

encoding of both semantic predictions and surprising words, based on preceding context. This suggests

that signals pertaining to prediction and error units can be observed in the same electrophysiological

responses to natural speech. In addition, temporal analysis of these signals reveals support for theories

of predictive coding that propose that perception is first biased towards what is expected followed by

what is informative.

Significance Statement

Over the past two decades, predictive coding has grown in popularity as an explanatory mechanism for

perception. However, there has been lack of empirical support for this theory in research studying

natural speech comprehension. We address this issue by developing an analysis framework that tests

the effects of top-down linguistic information on the auditory encoding of continuous speech. Our

results provide evidence for the co-existence of prediction and error signals and support theories of

predictive coding using more naturalistic stimuli.

.CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a

preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in

The copyright holder for thisthis version posted November 24, 2020. ; https://doi.org/10.1101/2020.11.20.391227doi: bioRxiv preprint

1. Introduction

A key question in neuroscience centers on how bottom-up sensory inputs combine with top-down prior

knowledge to subserve perception (Miller et al., 1951; Liberman et al., 1967). A popular idea is that

this is achieved through hierarchical Bayesian inference, whereby the brain infers the causes of its

sensory inputs by using an internal model of the world to generate predictions and then comparing those

predictions to incoming sensory input (Knill and Pouget, 2004; Aitchison and Lengyel, 2017). One

prominent mechanistic account – known as predictive coding – proposes that higher levels in the cortical

hierarchy predict lower-level bottom-up signals via feedback connections, whereas feedforward

connection convey only the error between top-down prediction and the bottom up sensory signal (Rao

and Ballard, 1999; Friston, 2005; Clark, 2013). A core tenet of this theory assumes the involvement of

two distinct populations of neurons at each cortical level: representation (prediction) units, which

encode predictions based on prior information, and error units, which encode the error (Rao and Ballard,

1999). Given this assumption, one should expect the co-presence of neural signals reflecting

computations from these distinct populations during perception. Indeed, while some evidence exists

(Egner et al., 2010), there is an overall lack of neuroimaging studies providing direct empirical support

for the simultaneous computation of prediction and prediction error. Thus, there is still ongoing debate

as to how prior information informs perception (Egner and Summerfield, 2013; Heilbron and Chait,

2017; Friston, 2018).

Studying speech processing in the brain represents a very powerful way to contribute to this debate.

This is because speech perception likely involves predictions across many levels of processing

(Kuperberg and Jaeger, 2016), and because the field of computational linguistics has given us methods

for quantifying how upcoming words may be predicted from their context (Bengio et al., 2003; Mikolov

et al., 2013; Buck et al., 2014; Pennington et al., 2014). In addition, pattern analysis techniques (Haxby,

2001; Kriegeskorte, 2008; Crosse et al., 2016) have helped adjudicate between predictive coding and

competing accounts of Bayesian inference by directly testing the stimulus features that are encoded in

the neural signal. Studies employing such approaches have produced mixed results, in some case

showing evidence of enhanced prediction signals (Kok et al., 2012; Leonard et al., 2016; Broderick et

al., 2019) and in other cases showing enhance error signals (Blank and Davis, 2016; Sohoglu and Davis,

2020). The notion of a two-process model where perception is first biased towards prior knowledge and

later upweights surprising events (Press et al., 2020) has the potential reconcile these seemingly

contrasting findings and also supports the co-existence of representation and error units.

Our current work aims to provide evidence for both representational and prediction error effects in the

same responses to natural speech. Specifically, we quantify the predictability of words using measures

from language models that have been previously linked to the computation of error and prediction. The

first of these measures – known as surprisal – quantifies the Kullach-Leibler divergence between prior

.CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a

preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in

The copyright holder for thisthis version posted November 24, 2020. ; https://doi.org/10.1101/2020.11.20.391227doi: bioRxiv preprint

and posterior probability distributions and thus reflects the degree of belief updating a comprehender

undergoes when processing new incoming words in context (Levy, 2008; Kuperberg and Jaeger, 2016).

It has been linked formally with prediction error in theories of predictive coding (Friston, 2005). The

second measure – known as semantic similarity – is derived by comparing words to their context based

on word embedding models. It has been previously shown to relate to the predictive preactivation of

the semantic features of words, something that is thought to underlie the N400 response (Federmeier

and Kutas, 1999; Ettinger et al., 2016; Broderick et al., 2020). We assessed how these measures of

surprisal and semantic similarity affected the encoding of bottom-up acoustic features using a recently

developed analysis framework (Broderick et al., 2019). Importantly, the measures of semantic similarity

and surprisal assign higher values to more expected and less expected words, respectively, and are only

weakly (and negatively) correlated. However, our results reveal that both measures independently affect

the encoding of acoustic-phonetic information. Additionally, we see that representation and error are

dissociable based on their timing of their effects, with the representational effect preceding that of the

surprisal effect (i.e., the error). This is, again, something that has been hypothesized in the literature

(Press et al., 2020).

2. Materials and Methods

2.1 Participants

Data from 19 native English speakers (6 female; aged 19-38 years) who participated in a previous study

was reanalysed for this study (Di Liberto et al., 2015; Broderick et al., 2018). The study was undertaken

in accordance with the Declaration of Helsinki and was approved by the Ethics Committee of the School

of Psychology at Trinity College Dublin. Each subject provided written informed consent. Subjects

reported no history of hearing impairment or neurological disorder.

2.2 Stimuli and experimental procedure

Subjects undertook 20 trials, each just under 180 seconds in length, where they were presented with an

audio-book version of a popular mid-20th century American work of fiction (Hemingway, 1952), read

by a single male American speaker. The average speech rate was 210 words/minute. The mean length

of each content word was 340 ms with standard deviation of 127 ms. Trials were presented

chronologically to the story. All stimuli were presented diotically at a sampling rate of 44.1 kHz using

Sennheiser HD650 headphones and Presentation software from Neurobehavioural Systems. Testing

was carried out in a dark, sound attenuated room and subjects were instructed to maintain visual fixation

on a crosshair centred on the screen for the duration of each trial, and to minimise eye blinking and all

other motor activities.

.CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a

preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in

The copyright holder for thisthis version posted November 24, 2020. ; https://doi.org/10.1101/2020.11.20.391227doi: bioRxiv preprint

2.3 EEG acquisition and preprocessing

128-channel EEG data were acquired at a rate of 512 Hz using an ActiveTwo system (BioSemi).

Triggers indicating the start of each trial were sent by the stimulus presentation computer and included

in the EEG recordings to ensure synchronization. Offline, the data were bandpass filtered between 1

and 8Hz using a Chebyshev Type II filter (order 54, cutoff 0.5Hz for high pass filtering and 8.5Hz for

low pass filtering). Passband attenuation was set to 1dB and stopband attenuation was set to 60dB (high

pass) and 80dB (low pass). After filtering, data were downsampled to 64 Hz (backward modelling) or

128 Hz (forward modelling; see below). To identify channels with excessive noise, the standard

deviation of the time series of each channel was compared with that of the surrounding channels. For

each trial, a channel was identified as noisy if its standard deviation was more than 2.5 times the mean

standard deviation of all other channels or less than the mean standard deviation of all other channels

divided by 2.5. Channels contaminated by noise were recalculated by spline interpolating the

surrounding clean channels in EEGLAB (Delorme and Makeig, 2004). The data were then re-referenced

to the global average of all channels.

2.4 Stimulus characterisation

Several acoustic and linguistic features were extracted from the speech signal and used as input at

various stages in a 2-stage regression analysis. These features can be categorised into 3 main groups

based on what stage they are used in the analysis.

1. The first group of features were regressed to the recorded EEG signal using the temporal response

function (see below), which learns a linear mapping between stimulus and neural response or vice

versa. Each feature was used to assess the encoding of speech at different hierarchical levels.

1a. Envelope. The broadband amplitude envelope of the speech signal was calculated using the

absolute value of the Hilbert transform.

1b. Spectrogram. The speech signal was filtered into 16 different frequency bands between

250Hz and 8kHz according to Greenwood’s equations (Greenwood, 1961). After filtering, the

amplitude envelope was calculated for each band using the absolute value of the Hilbert

Transform.

1c. Phonetic Features. To create the phonetic feature stimulus Prosodylab-Aligner software

(Gorman et al., 2011) was used. This automatically partitions each word in the story into

phonemes from the American English International Phonetic Alphabet (IPA) and performs

forced-alignment, returning the starting and ending time-points for each phoneme. Each phoneme

was then mapped to a corresponding set of 19 phonetic features, which was based on the

University of Iowa’s phonetics project http://prosodylab.cs.mcgill.ca/tools/aligner/.

.CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a

preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in

The copyright holder for thisthis version posted November 24, 2020. ; https://doi.org/10.1101/2020.11.20.391227doi: bioRxiv preprint

2. The second group of features estimates higher-level linguistic properties of words and was used

in the second stage of the regression analysis.

2a. Semantic Similarity was calculated for each content word in the narrative. It is estimated from

word embeddings derived using GloVe (Pennington et al., 2014) which was trained on a large

corpus (Common Crawl https://commoncrawl.org/). Each word is represented as 300-

dimensional vector where each dimension can be thought to reflect some latent linguistic context.

A word’s similarity index is estimated as the Pearson’s correlation between the word’s vector

and the averaged vector of all the preceding words in the sentence (Broderick et al., 2018). Thus,

small similarity values signify out-of-context words that, by extension, are unexpected.

2b. Surprisal was calculated using a Markov model trained on the same corpus as GloVe

(Common Crawl). These models, commonly referred to as n-grams, estimate the conditional

probability of the next word in a sequence given the previous n-1 words. We applied a 5-gram

model that was produced using interpolated modified Kneser-Ney smoothing (Chen and

Goodman, 1996; Buck et al., 2014). Unlike semantic similarity, surprisal assigns higher values

to more unexpected words.

3. The third group of stimulus characterisations was also used in the 2nd stage regression analysis

to act as nuisance regressors. Their purpose was to soak up any variance in word level auditory

encoding relating to acoustic changes in the speaker’s voice.

3a. Envelope Variability was calculated by taking the standard deviation of the speech envelope

over each spoken word. Here, we wished to control for rapid changes in the envelope amplitude

as it has been shown that cortical responses monotonically increase with steeper acoustic edges

(Oganian and Chang, 2019).

3b. Relative Pitch was recently shown to be encoded in EEG (Teoh et al., 2019). It quantifies

pitch normalised according to the vocal range of the speaker. Praat software (Boersma and

Weenink, 2000) was used to extract a continuous measure of pitch (absolute pitch). The measure

was then normalized to zero mean and unit standard deviation (z-units) to obtain relative pitch.

3c. Resolvability measures whether the harmonics of a sound can be processed within distinct

filters of the cochlea (resolved) or if they interact within the same filter (unresolved). It has

previously been shown using fMRI that pitch responses in auditory cortex are predominantly

driven by resolved frequency components (Norman-Haignere et al., 2013). Custom written

scripts from an acoustic statistics toolbox from the same study were used to extract a continuous

measure of harmonic resolvability.

.CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a

preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in

The copyright holder for thisthis version posted November 24, 2020. ; https://doi.org/10.1101/2020.11.20.391227doi: bioRxiv preprint

Co-existence of prediction and error signals in electrophysiological responses to natural speech

Figures

Citations

The Old Man and the Seaにおける文体

Electrophysiological indices of hierarchical speech processing differentially reflect the comprehension of speech in noise

References

Glove: Global Vectors for Word Representation

Distributed Representations of Words and Phrases and their Compositionality

EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis.

Praat, a system for doing phonetics by computer

A neural probabilistic language model

Related Papers (5)

Predictive Coding in Sensory Cortex

A hierarchy of linguistic predictions during natural language comprehension

Theory of Mind: A Neural Prediction Problem

Predictive coding as a model of cognition

Adverse conditions improve distinguishability of auditory, motor and perceptuo-motor theories of speech perception: an exploratory Bayesian modeling study

Frequently Asked Questions (3)

Q1. What have the authors contributed in "Co-existence of prediction and error signals in electrophysiological responses to natural speech" ?

Q2. What is the acoustic and phonetic coding?

Q3. What is the elife of the lexico-semantic predictions?

Trending Questions (1)