What is the simplest way to make the visualization independent of pitch?

To make the visualization relatively independent of ‘pitch’, the authors use the so-called mel-cepstral representation (MFCC, K = 13 coefficients pr. frame).

(Open Access) Cogito componentiter ergo sum (2006) | Lars Kai Hansen

Q: What contributions have the authors mentioned in the paper "Cogito componentiter ergo sum" ?

The authors present evidence that independent component analysis of abstract data such as text, social interactions, music, and speech leads to low level cognitive components.

Q: What is the meaning of the term-document matrix?

The vector space representation can be used for classification and retrieval by noting that similar documents are somehow expected to be ‘close’ in the vector space.

Q: What is the purpose of the research programme?

The authors are currently pursuing a research programme, trying to understand the limitations of the ecological hypothesis for higher level cognitive processes, such as grouping abstract objects, navigating social networks, understanding multi-speaker environments, and understanding the representational differences between self and environment.

Q: What is the definition of a cognitive component analysis?

The authors thus define cognitive component analysis (COCA) as unsupervised grouping of data such that the ensuing group structure is well-aligned with that resulting from human cognitive activity [8].

Q: What is the significance of the label structure in the real world?

It is a fascinating finding in many real world data sets that the label structure discovered by unsupervised learning closely coincides with labels obtained by letting a human or a group of humans perform classification, labels derived from human cognition.

Cogito componentiter ergo sum

Lars Kai Hansen and Ling Feng

Informatics and Mathematical Modelling,

Technical University of Denmark, DK-2800 Kgs. Lyngby, Denmark

lkh,lf@imm.dtu.dk, www.imm.dtu.dk

Abstract. Cognitive component analysis (COCA) is deﬁned as the pro-

cess of unsupervised grouping of data such that the ensuing group struc-

ture is well-aligned with that resulting from human cognitive activity.

We present evidence that independent component analysis of abstract

data such as text, social interactions, music, and speech leads to low

level cognitive components.

1 Introduction

During evolution human and animal visual, auditory, and other primary sensory

systems have adapted to a broad ecological ensemble of natural stimuli. This

long-time on-going adaption process has resulted in representations in human

and animal perceptual systems which closely resemble the information theo-

retically optimal representations obtained by independent component analysis

(ICA), see e.g., [1] on visual contrast representation, [2] on visual features in-

volved in color and stereo processing, and [3] on representations of sound fea-

tures. For a general discussion consult also the textbook [4]. The human per-

ceptional system can model complex multi-agent scenery. Human cognition uses

a broad spectrum of cues for analyzing perceptual input and separate individ-

ual signal producing agents, such as speakers, gestures, aﬀections etc. Humans

seem to be able to readily adapt strategies from one perceptual domain to an-

other and furthermore to apply these information processing strategies, such as,

object grouping, to both more abstract and more complex environments, than

have been present during evolution. Given our present, and rather detailed, un-

derstanding of the ICA-like representations in primary sensory systems, it seems

natural to pose the question: Are such information optimal representations rooted

in independence also relevant for modeling higher cognitive functions? We are

currently pursuing a research programme, trying to understand the limitations

of the ecological hypothesis for higher level cognitive processes, such as grouping

abstract objects, navigating social networks, understanding multi-speaker envi-

ronments, and understanding the representational diﬀerences between self and

environment.

Wagensberg has pointed to the importance of independence for successful

‘life forms’ [5]

A living individual is part of the world with some identity that tends to

become independent of the uncertainty of the rest of the world

Thus natural selection favors innovations that increase independence of the agent

in the face of environmental uncertainty, while maximizing the gain from the

predictable aspects of the niche. This view represents a precision of the classical

Darwinian formulation that natural selection simply favors adaptation to given

conditions. Wagensberg points out that recent biological innovations, such as ner-

vous systems and brains are means to decrease the sensitivity to un-predictable

ﬂuctuations. An important aspect of environmental analysis is to be able to rec-

ognize event induced by the self and other agents. Wagensberg also points out

that by creating alliances agents can give up independence for the beneﬁt of

a group, which in turns may increase independence for the group as an entity.

Both in its simple one-agent form and in the more tentative analysis of the group

model, Wagensberg’s theory emphasizes the crucial importance of statistical in-

dependence for evolution of perception, semantics and indeed cognition. While

cognition may be hard to quantify, its direct consequence, human behavior, has a

rich phenomenology which is becoming increasingly accessible to modeling. The

digitalization of everyday life as reﬂected, say, in telecommunication, commerce,

and media usage allows quantiﬁcation and modeling of human patterns of activ-

ity, often at the level of individuals. Grouping of events or objects in categories is

−100 −50 0 50 100 150

−60

−40

−20

−0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25

−0.2

−0.15

−0.1

−0.05

0.05

0.1

0.15

LATENT COMPONENT 4

LATENT COMPONENT 2

Fig. 1. Generic feature distribution produced by a linear mixture of sparse sources

(left) and a typical ‘latent semantic analysis’ scatter plot of principal component pro-

jections of a text database (right). The characteristics of a sparse signal is that it

consists of relatively few large magnitude samples on a background of small signals.

Latent semantic analysis of the so-called MED text database reveals that the semantic

comp onents are indeed very sparse and does follow the laten directions (principal com-

p onents). Topics are indicated by the diﬀerent markers. In [6] an ICA analysis of this

data set post-processed with simple heuristic classiﬁer showed that manually deﬁned

topics were very well aligned with the independent components. Hence, constituting

an example of cognitive component analysis: Unsupervised learning leads to a label

structure corresponding to that of human cognitive activity.

fundamental to human cognition. In machine learning, classiﬁcation is a rather

well-understoo d task when based on labelled examples [7]. In this case classiﬁca-

tion belongs to the class of supervised learning problems. Clustering is a closely

related unsupervised learning problem, in which we use general statistical rules

to group objects, without a priori providing a set of labelled examples. It is a

fascinating ﬁnding in many real world data sets that the label structure discov-

ered by unsupervised learning closely coincides with labels obtained by letting a

human or a group of humans perform classiﬁcation, labels derived from human

cognition. We thus deﬁne cognitive component analysis (COCA) as unsupervised

grouping of data such that the ensuing group structure is well-aligned with that

resulting from human cognitive activity [8]. This presentation is based on our

earlier results using ICA for abstract data such as text, dynamic text (chat),

web pages including text and images, see e.g., [9–13].

2 Where have we found cognitive components?

Text analysis. Symbol manipulation as in text is a hallmark of human cog-

nition. Salton proposed the so-called vector space representation for statistical

modeling of text data, for a review see [14]. A term set is chosen and a doc-

ument is represented by the vector of term frequencies. A document database

then forms a so-called term-document matrix. The vector space representation

can be used for classiﬁcation and retrieval by noting that similar documents

are somehow expected to be ‘close’ in the vector space. A metric can be based

on the simple Euclidean distance if document vectors are properly normalized,

otherwise angular distance may be useful. This approach is principled, fast, and

language independent. Deerwester and co-workers developed the concept of la-

tent semantics based on principal component analysis of the term-document

matrix [15]. The fundamental observation behind the latent semantic indexing

(LSI) approach is that similar documents are using similar vocabularies, hence,

the vectors of a given topic could appear as produced by a stochastic process

with highly correlated term-entries. By projecting the term-frequency vectors on

a relatively low dimensional subspace, say determined by the maximal amount

of variance one would be able to ﬁlter out the inevitable ‘noise’. Noise should

here be thought of as individual document diﬀerences in term usage within a

speciﬁc context. For well-deﬁned topics, one could simply hope that a given

context would have a stable core term set that would come out as a eigen ‘di-

rection’ in the term vector space. The orthogonality constraint of co-variance

matrix eigenvectors, however, often limits the interpretability of the LSI rep-

resentation, and LSI is therefore more often used as a dimensional reduction

tool. The representation can be post-processed to reveal cognitive components,

e.g., by interactive visualization schemes [16]. In Figure 1 (right) we indicate

the scatter plot of a small text database. The database consists of documents

with overlapping vocabulary but ﬁve diﬀerent (high level cognitive) labels. The

‘ray’-structure signaling a sparse linear mixture is evident.

Social networks. The ability to understand social networks is critical to hu-

mans. Is it possible that the simple unsupervised scheme for identiﬁcation of

independent components could play a role in this human capacity? To investi-

gate this issue we have initiated an analysis of a well-known social network of

some practical importance. The so-called actor network is a quantitative rep-

−0.1 −0.05 0 0.05 0.1

−0.15

−0.1

−0.05

0.05

0.1

EIGENCAST 3

EIGENCAST 5

Fig. 2. The so-called actor network quantiﬁes the collaborative pattern of 382.000

actors participating in almost 128.000 movies. For visualization we have projected

the data onto principal components (LSI) of the actor-actor co-variance matrix. The

eigenvectors of this matrix are called ‘eigencasts’ and they represent characteristic

communities of actors that tend to co-appear in movies. The network is extremely

sparse, so the most prominent variance components are related to near-disjunct sub-

communities of actors with many common movies. However, a close up of the coupling

b etween two latent semantic components (the region ∼ (0, 0)) reveals the ubiquitous

signature of a sparse linear mixture: A pronounced ‘ray’ structure emanating from

(0,0). The ICA components are color coded. We speculate that the cognitive machinery

developed for handling of independent events can also be used to locate independent

sub-communities, hence, navigate complex social networks.

resentation of the co-participation of actors in movies, for a discussion of this

network, see e.g., [17]. The observation model for the network is not too diﬀerent

from that of text. Each movie is represented by the cast, i.e., the list of actors.

We have converted the table of the about T = 128.000 movies with a total

of J = 382.000 individual actors, to a sparse J × T matrix. For visualization

we have projected the data onto principal components (LSI) of the actor-actor

co-variance matrix. The eigenvectors of this matrix are called ‘eigencasts’ and

represent characteristic communities of actors that tend to co-appear in movies.

The sparsity and magnitude of the network means that the components are dom-

inated by communities with very small intersections, however, a closer look at

such scatter plots reveals detail suggesting that a simple linear mixture model in-

deed provides a reasonable representation of the (small) coupling between these

relative trivial disjunct subsets, see Figure 2. Such insight may be used for com-

puter assisted navigation of collaborative, peer-to-peer networks, for example in

the context of search and retrieval.

Musical genre. The growing market for digital music and intelligent music

services creates an increasing interest in modeling of music data. It is now feasible

to estimate consensus musical genre by supervised learning from rather short

music segments, say 5-10 seconds, see e.g., [18], thus enabling computerized

handling of music request at a high cognitive complexity level. To understand

the possibilities and limitations for unsupervised modeling of music data we here

visualize a small music sample using the latent semantic analysis framework.

The intended use is for a music search engine function, hence, we envision that

−1 0 1

−5

−1 0 1

−5

−1 0 1

−2

−1 0 1

−2

−1 0 1

−0.2

0.2

0.4

−5 0 5

−5

−5 0 5

−2

−5 0 5

−2

−1 0 1

−0.1

0.1

0.2

−5 0 5

−2

−5 0 5

−2

−5 0 5

−2

−0.5 0 0.5 1

−0.2

0.2

−0.5 0 0.5 1

−0.6−0.4−0.2 0 0.2

−1.5

−1

−0.5

0.5

−2 0 2

−2

−1 −0.5 0 0.5

−0.2

0.2

0.4

−1 −0.5 0 0.5

−0.5 0 0.5 1

−2

−0.2 0 0.20.40.6

−0.4

−0.2

0.2

0.4

PC 1

PC 2

PC 3

PC 4

PC 5

Fig. 3. We represent three music tunes (genre labels: heavy metal, jazz, classical)

by their spectral content in overlapping small time frames (w = 30msec, with an overlap

of 10msec, see [18], for details). To make the visualization relatively independent of

‘pitch’, we use the so-called mel-cepstral representation (MFCC, K = 13 coeﬃcients

pr. frame). To reduce noise in the visualization we have ‘sparsiﬁed’ the amplitudes. This

was achieved simply by keeping coeﬃcients that belonged to the upper 5% magnitude

p ercentile. The total number of frames in the analysis was F = 10

. Latent semantic

analysis provided unsupervised subspaces with maximal variance for a given dimension.

We show the scatter plots of the data of the ﬁrst 1-5 latent dimensions. The scatter

plots below the diagonal have been ‘zoomed’ to reveal more details of the ICA ‘ray’

structure. For interpretation we have coded the data points with signatures of the three

genres involved: classical (∗), heavy metal (diamond), jazz (+). The ICA ray structure

is striking, however, note that the situation is not one-to-one (ray to genre) as in the

small text databases. A component (ray) quantiﬁes a characteristic musical ‘theme’

at the temporal level of a frame (30msec), i.e., an entity similar to the ‘phoneme’ in

sp eech.

a largely text based query has resulted in a few music entries, and the algorithm

is going to ﬁnd the group structure inherent in the retrieval for the user. We

Cogito componentiter ergo sum

Figures

Citations

Semantic Contours in Tracks Based on Emotional Tags

On Phonemes As Cognitive Components of Speech

Cognitive Components of Speech at Different Time Scales

Cognitive Component Analysis

Attention: A machine learning perspective

References

Multimedia Image and Video Processing

Signal Detection using ICA: Application to Chat Room Topic Spotting

Independent component analysis for understanding multimedia content

On Independent Component Analysis for Multimedia Signals

Decision time horizon for music genre classification using short time features

Related Papers (1)

The "independent components" of natural scenes are edge filters.

Frequently Asked Questions (11)

Q1. What contributions have the authors mentioned in the paper "Cogito componentiter ergo sum" ?

Q2. What is the metric for term frequency?

Q3. How do the authors filter out the inevitable noise?

Q4. What is the meaning of the term-document matrix?

Q5. What is the purpose of the research programme?

Q6. What is the simplest way to make the visualization independent of pitch?

Q7. What is the fundamental observation behind the latent semantic indexing approach?

Q8. What are the eigenvectors of the matrix?

Q9. What is the definition of a cognitive component analysis?

Q10. What is the significance of the label structure in the real world?

Q11. How can the authors estimate consensus musical genre?