Transfer Learning: A Riemannian Geometry Framework With Applications to Brain–Computer Interfaces

doi:10.1109/TBME.2017.2742541

HAL Id: hal-01923278

https://hal.archives-ouvertes.fr/hal-01923278

Submitted on 15 Nov 2018

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Transfer learning: a Riemannian geometry framework

with applications to Brain-Computer Interfaces

Paolo Zanini, Marco Congedo, Christian Jutten, Salem Said, Yannick

Berthoumieu

To cite this version:

Paolo Zanini, Marco Congedo, Christian Jutten, Salem Said, Yannick Berthoumieu. Transfer learning:

a Riemannian geometry framework with applications to Brain-Computer Interfaces. IEEE Trans-

actions on Biomedical Engineering, Institute of Electrical and Electronics Engineers, 2018, 65 (5),

pp.1107-1116. �10.1109/TBME.2017.2742541�. �hal-01923278�

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, 65(5), 1107-1116 1

Transfer learning: a Riemannian geometry

framework with applications to Brain-Computer

Interfaces

Paolo Zanini, Marco Congedo, Christian Jutten, Salem Said, and Yannick Berthoumieu

Abstract—Objective: This paper tackles the problem of trans-

fer learning in the context of EEG-based Brain Computer

Interface (BCI) classiﬁcation. In particular the problems of cross-

session and cross-subject classiﬁcation are considered. These

problems concern the ability to use data from previous sessions

or from a database of past users to calibrate and initialize

the classiﬁer, allowing a calibration-less BCI mode of operation.

Methods: Data are represented using spatial covariance matrices

of the EEG signals, exploiting the recent successful techniques

based on the Riemannian geometry of the manifold of Symmetric

Positive Deﬁnite (SPD) matrices. Cross-session and cross-subject

classiﬁcation can be difﬁcult, due to the many changes intervening

between sessions and between subjects, including physiologi-

cal, environmental, as well as instrumental changes. Here we

propose to afﬁne transform the covariance matrices of every

session/subject in order to center them with respect to a reference

covariance matrix, making data from different sessions/subjects

comparable. Then, classiﬁcation is performed both using a stan-

dard Minimum Distance to Mean (MDM) classiﬁer, and through

a probabilistic classiﬁer recently developed in the literature,

based on a density function (mixture of Riemannian Gaussian

distributions) deﬁned on the SPD manifold. Results: The im-

provements in terms of classiﬁcation performances achieved by

introducing the afﬁne transformation are documented with the

analysis of two BCI data sets. Conclusion and signiﬁcance: Hence,

we make, through the afﬁne transformation proposed, data from

different sessions and subject comparable, providing a signiﬁcant

improvement in the BCI transfer learning problem.

Index Terms—Brain Computer Interface, electroencephalog-

raphy, covariance matrices, Riemannian geometry, mixtures of

Gaussian.

I. INTRODUCTION

A

Brain Computer Interface (BCI) is a system capable of

predicting or classifying cognitive states and intentions

of the user through the analysis of neurophysiological signals

[24], [32]. Historically, BCIs have been developed to allow

severely paralyzed people to communicate or interact with

their environment without relying on the normal muscular

or peripheral nerve outputs [8]. More recently, BCIs have

been proposed also for healthy people, for instance in driving,

P. Zanini is at Gipsa-Lab, Universit

´

e Grenoble Alpes, France, and IMS,

Universit

´

e de Bordeaux, France (e-mail: paolo.zanini@gipsa-lab.fr).

M. Congedo and C. Jutten are at Gipsa-Lab, Universit

´

e Grenoble Alpes,

France (e-mail: marco.congedo@gipsa-lab.fr; christian.jutten@gipsa-lab.fr).

S. Said and Y. Berthoumieu are at Univ. Bordeaux, Bordeaux INP,

CNRS, IMS, UMR 5218, 33400 Talence, France (e-mail: salem.said@ims-

bordeaux.fr; yannick.berthoumieu@ims-bordeaux.fr).

However, permission to use this material for any other purposes must be

obtained from the IEEE by sending an email to pubs-permissions@ieee.org.

forensics, or gaming applications [11], [20], [29]. Several

neurophysiological signals can be used for a BCI, either

invasive or semi-invasive, like electrodes implanted into the

grey matter or sub-durally. Most BCIs however make use of

non-invasive neuroimaging modalities, such as near-infrared

spectroscopy and, especially, electroencephalography (EEG),

which suit both clinical and healthy populations. In this paper

we focus on EEG-based BCIs.

The standard classiﬁcation technique consists of two opera-

tional stages [9], [18]. First, EEG signals of a training set are

transformed through frequency and/or spatial ﬁlters in order

to extract discriminant features [8], [16]. A very popular ﬁlter

in this stage is the Common Spatial Pattern (CSP) [18], [19].

Second, the features enter a machine learning algorithm in

order to compute a decision function for performing classiﬁ-

cation on the test set. This is done by supervised techniques

like, for instance, Linear Discriminant Analysis (LDA) [9].

A different approach was presented in [2], where classi-

ﬁcation is performed using the signal covariance matrices

as feature of interest. Covariance matrices do not belong

to an Euclidean space, instead they belong to the smooth

Riemannian manifold of Symmetric Positive Deﬁnite (SPD)

matrices [5]. Hence, in [2], the properties of SPD manifold are

used to perform BCI classiﬁcation directly on the manifold,

as illustrated in subsection II-D. In this paper we consider two

separate improvements with respect to the method described

in [2]. The ﬁrst improvement relates to the classiﬁcation

techniques. In [2] the authors used a basic classiﬁer, named

Minimum Distance to Mean (MDM), which takes into account

distances on the manifold between the observations and some

reference points of the classes, known as centers of mass,

means, or barycenters. Here we introduce a probabilistic clas-

siﬁer, modeling the class probability distributions, exploiting

Riemannian Gaussian and mixture of Gaussian distributions

introduced in [34], and applied to EEG classiﬁcation in [37].

The second improvement relates to the problem of transfer

learning [30]. In the machine learning ﬁeld, transfer learning is

deﬁned as the ability to use previous knowledge as features in

a new task/domain related to the previous one. Some examples

of transfer learning applied to BCI problem can be found in

[15], [21], [27] and [36]. In this paper we focus speciﬁcally on

the problem of cross-session and cross-subject BCI learning.

A classical BCI requires a calibration stage at each run,

even for a known user. The calibration stage, however short,

is inconvenient both for patients, because it wastes part of

their limited attention, and for the general public, which is

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, 65(5), 1107-1116 2

usually unwilling to undergo repeated calibration sessions.

As proposed in [12] a BCI should be able to calibrate on-

line while it is being used. The problem is then to provide a

workable initialization, that is, one that allows the operation

of the BCI at the very beginning of the session, even if

suboptimal. For a new user, a database of past users can be

considered to initialize the classiﬁer. This form of learning

is referred to as cross-subject learning. From the second

usage on, past data from previous sessions of the user can

be employed. This is referred to as cross-session learning.

Cross-session learning is known to be a difﬁcult task due to

several changes intervening in between the sessions, including

physiological, environmental, as well as instrumental changes

(e.g., electrode positioning and impedance). Even more dif-

ﬁcult is the cross-subject learning, because the spatial and

temporal conﬁguration of brain dipolar sources is subject to

substantial individual variability. In the Riemannian framework

the cross-session and cross-subject changes can be understood

as geometric transformations of the covariance matrices. In

this work we will refer to this geometric transformation as a

“shift”, although we should keep in mind that a transformation

may entail more than a simple displacement on the manifold.

A ﬁrst attempt to solve the shift problem is described in

[33], however this work does not consider the structure of

the covariance matrix manifold. In [3], instead, the authors

introduce a way to solve the shift problem in a Riemannian

framework, for the cross-session situation, however this ap-

proach depends on the order of the tasks performed during an

experiment and on the (unknown) structure of the classes in

the classiﬁcation problem. In this paper we develop an idea

similar to the one presented in [33], but in a Riemannian

framework. Our approach does not depend on the (unknown)

label sequence of the observations obtained during the ex-

periment. We assume that different source conﬁgurations and

electrode positions induce shifts of covariance matrices with

respect to a reference (resting) state, but that when the brain

is engaged in a speciﬁc task, covariance matrices move over

the SPD manifold in the same direction. This assumption

allows a workable model and a simple solution thanks to

the congruence invariance property of SPD matrices (that

we will describe in subsection II-A). We will center the

covariance matrices of every session/subject with respect to

a reference covariance matrix so that what we observe is only

the displacement with respect to the reference state due to

the task. We estimate a reference matrix for every session,

but different between sessions and between subjects. Then,

we perform a congruent transformation of our data using this

reference matrix. In this way observations belonging to the

same session and subject do not change their relative distances

and geometric structure. However, since the reference matrix

varies among sessions and among subjects, these data are

moved in the manifold in different directions and, if the

reference matrix is chosen accurately, data from different

sessions/subjects become comparable. As we will show with

the analysis of two BCI data sets, this procedure provides

an efﬁcient initialization for cross-session and cross-subject

classiﬁcation problems.

In EEG-based BCI literature, different kinds of tasks can be

used to design a BCI (see [12] for an exhaustive description).

In this work we analyze two different paradigms in order

to widen the scope of our analysis. The ﬁrst one relates to

a Motor Imagery (MI) paradigm and the second one to an

Event-Related Potential (ERP) paradigm. For the ﬁrst dataset

we analyze nine subjects, each one performing two sessions,

and we evaluate the accuracy for cross-session and cross-

subject classiﬁcation. We obtain signiﬁcant improvements by

using the proposed procedure, especially for the cross-subject

classiﬁcation, where we can increase the performance by 30%

in some cases. For the second dataset we analyze 17 subjects

and we evaluate the precision for cross-subject classiﬁcation.

Also in this case we obtain substantial improvements by

introducing our procedure. Furthermore, for both datasets, we

discuss the situations where the introduction of a probabilistic

classiﬁer can result in further improvements.

The paper is organized as it follows. In Section II basic

concepts of Riemannian geometry are introduced. In Section

III the two BCI paradigms are described in details, focusing

in particular on how to build SPD matrices in the two cases to

be used in a Riemannian framework. Then, in Section IV we

describe the proposed Riemannian transfer learning methods.

In Section V we present the results obtained with the two

datasets analyzed. Finally, we conclude our work in Section

VI.

II. ELEMENTS OF RIEMANNIAN GEOMETRY

In this section we present some basic properties of the space

of SPD matrices, introducing a probabilistic distribution on

this space and deﬁning some classiﬁcation rules to classify

SPD matrices.

A. Manifold of SPD matrices: basic concepts

We start by introducing M (n) and S(n) as the vector space

of n × n square matrices, and the vector space in M(n) of

symmetric n × n square matrices, respectively. Speciﬁcally,

M(n) = {M ∈ R

n×n

}, while S(n) = {S ∈ M(n), S =

S

T

}. The set of SPD matrices P (n) = {P ∈ S(n), u

T

P u >

0 ∀ u ∈ R

n

, u 6= 0} is an open subset of S(n), in particular

it is an open convex cone of dimension

n(n+1)

2

. P (n) is the

space of covariance matrices and it is our space of interest. If

endowed with the Fisher-Rao metrics [5], P (n) turns out to be

a smooth Riemannian manifold with non positive curvature.

This means that for every point P ∈ P (n), in the tangent

space T

P

(that in this case can be identiﬁed with S(n)), we

deﬁne a scalar product which varies smoothly with P . The

local inner product and, as a consequence, the local norm, are

deﬁned as

hU, V i

P

= tr(P

−1

UP

−1

V ), (1)

kUk

2

P

= hU, U i

P

,

respectively, where U, V ∈ S(n). Through the natural metrics

(1), a distance between two points P

1

, P

2

∈ P (n) can be

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, 65(5), 1107-1116 3

deﬁned as the length of the unique shortest curve (called

geodesic) connecting P

1

and P

2

[5]

δ(P

1

, P

2

) = k log(P

−1/2

1

P

2

P

−1/2

1

)k

F

=

n

X

i=1

log

2

λ

i

!

1/2

,

(2)

with k · k

F

the Frobenius norm, and λ

1

, ..., λ

n

the eigenvalues

of P

−1/2

1

P

2

P

−1/2

1

(or P

−1

1

P

2

, with the indices 1 and 2 that

can be permuted since δ(·, ·) is symmetric). The Riemannian

distance δ(·, ·) has two important invariances:

i. δ(P

−1

1

, P

−1

2

) = δ(P

1

, P

2

);

ii. δ(C

T

P

1

C, C

T

P

2

C) = δ(P

1

, P

2

) ∀C ∈ GL(n),

with GL(n) = {C ∈ M(n), C invertible} the set of invertible

matrices. Property ii, called congruence invariance, means

that the distance between two SPD matrices is invariant with

respect to a change of reference, i.e., to any linear invertible

transformation in the data (recordings) space. This property

will be particularly important in the following.

B. Center of mass of a set of SPD matrices

The simplest statistical descriptor of a set of objects is the

concept of mean value, which is meant to provide a suitable

representative of the set. The most famous mean is the arith-

metic mean. It has an important variational characterization:

given a set P

1

, ..., P

N

of SPD matrices, the arithmetic mean

A(P

1

, ..., P

N

) is the point P which minimizes the sum of

squared Euclidian distances d

e

(·, ·)

A(P

1

, ..., P

N

) = arg min

P ∈P (n)

N

X

i=1

d

2

e

(P

i

, P ), (3)

Similarly, it has been shown that we can use the Riemannian

distance to deﬁne a geometric mean, or center of mass, of a

set of SPD matrices, through a variational approach [6]. The

center of mass G(P

1

, ..., P

N

) is deﬁned as the point of the

manifold satisfying

G(P

1

, ..., P

N

) = arg min

P ∈P (n)

N

X

i=1

δ

2

(P

i

, P ). (4)

with δ(·, ·) deﬁned in (2). In the literature, (4) is often called

Cartan/Fr

´

echet/Karcher mean [5], [6], [22]. Since P (n) is

a Riemannian manifold of non-positive curvature, existence

and unicity of the Riemannian mean can be proved [1], [28].

However, an explicit solution exists only for N = 2, where

it coincides with the middle point of the geodesic connecting

the two SPD matrices of the set. For N > 2 a solution can

be found iteratively and several algorithms following different

approaches have been developed in the literature [22]. Some

of them try to ﬁnd the right value through numerical procedure

like deterministic line search [17], [26], simple or stochastic

gradient descent [10], [31]. Other faster and computational

lighter approaches look for some suitable approximation of

the center of mass, see for instance [6], [13], [14].

An important invariance property for the center of mass is:

G(C

T

P

1

C, ..., C

T

P

N

C) = C

T

G(P

1

, ..., P

N

)C ∀C ∈ GL(n),

inherited from the congruance invariance of the Riemannian

distance mentioned above. This result means that the center

of gravity is shifted through the same afﬁne transformation as

the matrices of the set.

C. Mixtures of Gaussian distributions on the manifold of SPD

matrices

Distance and center of mass are geometric concepts con-

cerning the properties of the manifold of SPD matrices, but

they do not concern any probabilistic assumptions on a sample

of SPD matrices. To consider a probabilistic model we intro-

duce a class of probability distributions on the space P (n),

called Riemannian Gaussian distributions and deﬁned in [34].

It will be denoted G(

P , σ) and depends on two parameters,

P ∈ P (n) and σ > 0. It is deﬁned by its probability density

function

f(P |P , σ) =

1

ζ(σ)

exp



−

δ

2

(P, P )

σ

2



(5)

where ζ(σ) is a normalization function. In [34] it has been

shown that, given P

1

, ..., P

N

i.i.d. from (5), the Maximum

Likelihood Estimator (MLE) of P coincides with the center of

mass (4). For the MLE of σ, instead, an efﬁcient procedure is

presented in [37]. If we consider only Gaussian distribution,

we are not able to describe a wide range of real problems.

In general in the classical Euclidean framework, in order to

include several distribution shapes, mixtures of Gaussians have

been considered [34]. In the Riemannian framework this is also

possible in a straightforward way. A mixture of Riemannian

Gaussian distributions is a distribution on P (n) whose density

function can be written as

f(P ) =

M

X

m=1

w

m

f(P |P

m

, σ

m

), (6)

with w

1

, ..., w

M

non-negative weights summing up to 1.

The parameters of (6) can be found, for instance, through

an Expectation-Maximization (EM) algorithm, as described

in [34]. This class of distributions will be used to build a

probabilistic classiﬁer for data in P (n), as described in the

next subsection.

D. Classiﬁcation techniques in the manifold of SPD matrices

In [2] the authors proposed a classiﬁcation procedure based

on Minimum Distance to Mean (MDM) classiﬁer, which is

deﬁned as it follows: given K classes and a training phase

where the centers of mass

b

C(k) of the classes (k = 1, ..., K)

are estimated, a new observation C

i

is assigned to the

b

k class

according to the classiﬁcation rule

b

k = arg min

k∈{1,...,K}

{d

R

(C

i

,

b

C(k))}. (7)

This rule takes into consideration the Riemannian distance

of the new observation to the centers of mass, ignoring

information on the dispersion of the groups, encoded by the

parameter σ in the Riemannian Gaussian distribution (5). The

principle of Bayesian classiﬁcation can be used exploiting such

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, 65(5), 1107-1116 4

a distribution. In this case, the classiﬁcation rule based on the

a posteriori distribution reads

b

k = arg min

k∈{1,...,K}

(

log ζ(bσ(k)) +

d

2

R

(C

i

,

b

C(k))

2 bσ

2

(k)

)

, (8)

where bσ(k) is the MLE estimate of the dispersion parameter

of the k-th class [37]. Of course, if the bσ(k) coincide for all

classes, (8) reduces to (7). In order not to be limited to a simple

class of distributions, we can consider mixtures of Gaussian

(6), updating Bayesian classiﬁcation rule accordingly. In this

paper we consider a number of mixture components M

varying from 2 to 4.

III. DATA

We analyze two different EEG-based BCI datasets, related

to MI and ERP frameworks. The way to build SPD matrices is

different between the two cases and it is described in subsec-

tion III-A and III-B, respectively. Then, in subsection III-C, we

will show how cross-session and cross-subject classiﬁcations

can be problematic, exploiting a visualization technique for

high-dimensional data named t-Stochastic Neighbor Embed-

ding (t-SNE) [35].

A. Motor Imagery: data construction

The analyzed dataset is the one from BCI competition [25],

already analyzed in [2], [18]. It contains EEG data from nine

subjects performing four kinds of motor imagery (right hand,

left hand, foot, and tongue imagined movements). A total of

576 trials per subject are available, each trial corresponding to

a movement (balanced experiment, i.e., 144 trials per class).

Half of the trials (288) are obtained during the ﬁrst session,

and the other half during a second session. For each trial l we

register the centered EEG signal X

l

∈ R

n×T

, where n is the

number of electrodes and T the number of sample points of

the time window considered to evaluate sample covariance, in

this case from 0.5 to 2.5 seconds after the stimulus. Then we

use for the analysis the empirical covariance matrix deﬁned as

C

X

l

=

1

T − 1

X

l

X

T

l

.

In this experiment signals are recorded using 22 electrodes

(n = 22), hence covariance matrices here belong to P (22). As

usual with motor imagery data, before computing covariance

matrices, EEG signals are bandpass ﬁltered by a 5-th order

Butterworth ﬁlter in the frequency band of 8 – 30 Hz.

B. ERP: data construction

This dataset cames from a Brain Invaders experiment car-

ried out at GIPSA-lab in Grenoble, France [11]. Subjects

watch a screen with 36 aliens ﬂashing alternatively. They are

requested to mentally count the number of speciﬁc (known)

target alien ﬂashes. This experiment generates in the EEG

signals an Event-Related Potential (ERP) named P300 when-

ever the target alien ﬂashes [11]. The main goal is to detect

the target trials from the EEG signals. Thus, we have two

classes in this experiment, P300 signals (target class) and

normal signals (non target class). In this framework we cannot

simply consider the covariance matrices C

X

l

. Indeed, if we

randomly shufﬂe the time instants for a speciﬁc trial, the

estimate of its covariance matrix does not change, and thus the

classiﬁcation result. Since temporal information are essential

to detect ERP, we augmented the vector by integrating a

component related to the temporal proﬁle of the ERP event

considered, following the procedure described in [4] and [23].

Speciﬁcally, we considered the average ERP response

E =

1

|K

+

|

X

l∈K

+

X

l

∈ R

n×T

,

where K

+

is the group of target trials (ERP in this case). Then

we built an augmented trial signal matrix

e

X

l

, deﬁned as

e

X

l

=



E

X

l



∈ R

2n×T

,

and then we considered an augmented covariance matrix

e

C

e

X

l

of dimension 2n × 2n:

e

C

e

X

l

=



C

E

C

EX

l

C

X

l

E

C

X

l



.

Relevant information for distinguishing a target from a non-

target trial is embedded in the block C

EX

l

(and in its transpose

C

X

l

E

). In these blocks, entries will be far from zero only

for target trials, since only the time series of target trials are

correlated to the average ERP E. Thus, on the SPD manifold

augmented covariance matrices for target trials will be far

from the augmented covariance matrices for non-target trials.

Notice that if we randomly shufﬂe the time instants for a

speciﬁc trials, the augmented covariance matrix does change,

which means that we have effectively embedded the temporal

information into these matrices. A training-phase is needed

to build the average ERP response. In this experiment we

consider 17 subjects, with a number of trials different from

one subject to another, ranging from 500 to 750. EEG signals

are recorded at a frequency of 512 Hz using 13 electrodes

(i.e. n = 13), hence covariance matrices here belong to P (26).

Every trial is registered for a period of time of one second after

the stimulus (the ﬂash). Thus, augmented covariance matrices

are estimated using 512 observations.

C. Data visualization using t-SNE

The visualization technique called t-SNE [35], visualizes

high-dimensional data by mapping each point to a location

in a 2- or 3-dimensional space, while optimizing the pairwise

distances in the reduced space with respect to the distances in

the original manifold. In our case we aim to represent each

covariance matrix as a point in a 2 dimensional space in order

to appreciate the effect of cross-session and cross-subject shift.

In Figure 1 and 5 the data from the MI experiment are

shown. In each plot of Figure 1, data for the two sessions

are depicted (circles for session 1 and crosses for session 2),

with colors identifying the classes. In Figure 5 a more detailed

representation of subject 9 is depicted, with plots divided by

class. We can observe that data relative to session 2 are shifted

with respect to session 1, for every subject. This means that, in

Transfer Learning: A Riemannian Geometry Framework With Applications to Brain–Computer Interfaces

Figures

Citations

A Review of Classification Algorithms for EEG-based Brain-Computer Interfaces: A 10-year Update

Transfer Learning for Brain–Computer Interfaces: A Euclidean Space Data Alignment Approach

A review on transfer learning in EEG signal analysis

Riemannian Procrustes Analysis: Transfer Learning for Brain–Computer Interfaces

Transfer Learning for EEG-Based Brain-Computer Interfaces: A Review of Progress Made Since 2016

References

Visualizing Data using t-SNE

A Survey on Transfer Learning

Optimizing Spatial filters for Robust EEG Single-Trial Analysis

Positive Definite Matrices

A Riemannian Framework for Tensor Computing

Related Papers (5)

A Survey on Transfer Learning

A Review of Classification Algorithms for EEG-based Brain-Computer Interfaces: A 10-year Update

Optimal spatial filtering of single trial EEG during imagined hand movement

Brain-computer interfaces for communication and control.

Deep learning with convolutional neural networks for EEG decoding and visualization.