Archived at the Flinders Academic Commons:

http://dspace.flinders.edu.au/dspace/

This is the authors’ version of an article published in

Lecture Notes in Computer Science. The original publication

is available by subscription at:

http://link.springer.com/

doi: 10.1007/978-3-642-38786-9_17

Please cite this article as:

Powers, D.M. (2013). A computationally and cognitively

plausible model of supervised and unsupervised

learning. In D Liu et. al (Ed.), Advances in Brain Inspired

Cognitive Systems: Vol 7888, 6th International

Conference, BICS 2013, Beijing, China, June 9-11, 2013.

Proceedings (pp. 145-156) Berlin: Springer Berlin

Heidelberg.

Copyright (2013) Springer-Verlag. All right

s reserved. Please

note that any alterations made during the publishing

process may not appear in this version. The final publication

is available at link.springer.com”.

Archived at the Flinders Academic Commons: http://dspace.flinders.edu.au/dspace/

A computationally and cognitively plausible model

of supervised and unsupervised learning

David M. W. Powers

1,2

1

CSEM Centre for Knowledge & Interaction Technology, Flinders University,

Adelaide, South Australia

2

Beijing Municipal Lab for Multimedia & Intelligent Software, BJUT

Beijing, China

powers@acm.org

Abstract. Both empirical and mathematical demonstrations of the importance

of chance-corrected measures are discussed, and a new model of learning is

proposed based on empirical psychological results on association learning. Two

forms of this model are developed, the Informatron as a chance-corrected

Perceptron, and AdaBook as a chance-corrected AdaBoost procedure.

Computational results presented show chance correction facilitates learning.

Keywords: Chance-corrected evaluation, Kappa, Perceptron, AdaBoost

1 Introduction

*

The issue of chance correction has been discussed for many decades in the context of

statistics, psychology and machine learning, with multiple measures being shown to

have desirable properties, including various definitions of Kappa or Correlation, and

the psychologically validated ΔP measures. In this paper, we discuss the relationships

between these measures, showing that they form part of a single family of measures,

and that using an appropriate measure can positively impact learning.

1.1 What’s in a “word”?

In the Informatron model we present, we will be aiming to model results in human

association and language processing. The typical task is a word association model,

but other tasks may focus on syllables or rimes or orthography. The “word” is not a

well-defined unit psychologically or linguistically, and is arguably now a backformed

concept from modern orthology. Thus we use “word” for want of a better word, and the

scare quotes should be imagined to be there at all times, although they are frequently

omitted for readability! (Consider “into” vs “out of”, “bring around” vs “umbringen”.)

*

An extended abstract based on an earlier version has been submitted for presentation to the

Cognitive Science Society (in accordance with their policy of being of “limited circulation”).

Archived at the Flinders Academic Commons: http://dspace.flinders.edu.au/dspace/

1.2 What’s in a “measure”?

A primary focus of this paper is the inadequacy of currently used measures such as

Accuracy, True Positive Rate, Precision, F-score, etc. Alternate chance-corrected

measures have been advocated in multiple areas of cognitive, computational and

physical science, and in particular in Psychology in the specific context of

(unsupervised) association learning [1-3], where ΔP is considered “the normative

measure of contingency”.

In parallel, discontent with misleading measures of accuracy was building in

Statistics [4,5], Computational Linguistics [6] and Machine Learning [7] and

extended to the broader Cognitive Science community [8]. Reversions to older

methods such as Kappa and Correlation (and ROC AUC, AUK, etc.) were proposed

and in this paper we explore learning models that directly optimize such measures.

2 Informedness, Correlation & DeltaP

The concept of chance-corrected accuracy measures has been reinvented several times

in several contexts, with some of the most important being Kappa variants [4,5].

This is an ad hoc approach that subtracts from accuracy (Ac) an estimate of the

chance-level accuracy (EAc) and renormalizes to the form of a probability

Κ=(Ac–EAc)/(1–EAc). But different forms of chance estimate, different forms of

normalization, and different generalizations to multiple classes or raters/predictors,

lead to a whole family of Kappa measures of which ΔP turns out to be one, and ΔP’

another [9]. The geometric mean of these two unidirectional measures is correlation,

which is thus a measure of mean association over both directions of an A↔B relation

between events. Perruchet and Pereman [3] focus on an A, B word sequence and

define ΔP as a chance-corrected version of TP = P(B|A), corresponding to Precision

(proportion of events A that predict B correctly), whilst ΔP’ corrects TP’ = P(A|B)

which is better known as TPR, Sensitivity or Recall, meaning the proportion of events

B that are predicted by A – on the assumption that forward prediction A→B is

normative. They argue for comparing TP with a baseline of how often event B occurs

when not preceded by A so that ΔP = P(B|A) – P(B|¬A) and ΔP’ = P(A|B) – P(A|¬B).

Empirically ΔP’ is stronger than ΔP in these experiments, and TP and TP’ are

much weaker, with TP failing to achieve a significant result for either Children or

Adults in their experiments. Why should the reverse direction be stronger? One

reason may be that an occurrence in the past is more definite for the speaker and has

been more deeply processed for the hearer. Furthermore, often a following segment

may help disambiguate a preceding one. Thus in computational work at both word

level and phoneme/grapheme level, the preceding two units and the succeeding three

units, seem to be optimal in association-based syntax and morphology learning

models [10,11], and two-side context has also proven important in semantic models

[12]. However, Flach [7] and Powers [8] independently derived ΔP’-equivalent

measures, not ΔP, as a skew/chance independent measure for A→B predictions as the

information value relates to (and should be conditioned on) the prevalence of B not A.

Archived at the Flinders Academic Commons: http://dspace.flinders.edu.au/dspace/

In view of these Machine Learning proofs we turn there to introduce and motivate

definitions in a statistical notation that conflicts with that quoted above from the

Psychology literature. We use systematic acronyms [7,8] in upper case for counts,

lower case for rates or probabilities. In dichotomous Machine Learning [7] we

assume that we have for each instance a Real class label which is either Positive or

Negative (counts, RP or RN, rates rp=RP/N and rn=RN/N where we have N instances

labelled). We assume that our classifier, or in Association Learning the predictor,

specifies one Predicted class label as being the most likely for each instance (counts,

PP or PN, probs pp and pn). We further define True and False Positives and Negatives

based on whether the prediction P or N was accurate or not (counts, TP, TN, FP, FN;

probs tp, tn, fp, fn; rates tpr=tp/rp, tnr=tn/rn, fpr=fp/rn).

Table 1: Prob notation for dichotomous contingency matrix.

+R

−R

+P

tp

fp

pp

−P

fn

tn

pn

rp

rn

1

Whilst the above systematic notation is convenient for derivations and proofs,

these probabilities (probs) are known by a number of different names and we will use

some of these terms (and shortened forms) for clarity of equations and discussions.

The probs rp and rn are also known as Prevalence (Prev) and Inverse Prevalence

(IPrev), whilst pp and bn are Bias and Inverse Bias (IBias) resp. Also Recall and

Sensitivity are synonyms for true positive rate (tpr), whilst Inverse Recall and

Specificity correspond to true negative rate (tnr). The term rate is used when we are

talking about the rate of finding or recalling the real item or label, that is the

proportion of the real items with the label that are recalled. When we are talking

about the accuracy of a prediction in the sense of how many of our predictions are

accurate we use the term accuracy, with Precision (Prec) or true positive accuracy

being tpa=tp/pp, and Inverse Precision or true negative accuracy being tna=tn/pn, and

our (perverse) prediction accuracy for false positives being fpa=fp/pp. We also use

fpa and fna correspondingly for the perverse accuracies predicting the wrong (false)

class. Names for other probs [13] won’t be needed.

The chance-corrected measure ΔP’ turns out to be the dichotomous case of

Informedness, the probability that a prediction is informed with respect to the real

variable (rather than chance). This was proven based on considerations of odds-

setting in horse-racing, and is well known as a mechanism for debiasing multiple

choice exams [8,13]. It has also been derived as skew-insensitive Weighted Relative

Accuracy (siWRAcc) based on consideration of ROC curves [7]. As previously

shown in another notation, it is given by:

ΔP’ = tpr–fpr = tpr+tnr–1 = Sensitivity + Specificity – 1 (1)

Archived at the Flinders Academic Commons: http://dspace.flinders.edu.au/dspace/

The inverse concept is Markedness, the probability that the predicting variable is

actually marked by the real variable (rather than occuring independently or randomly).

This reduces to ΔP in the dichotomous case:

ΔP = tpa–fpa = tpa+tna–1 = Prec + IPrec – 1 (2)

As noted earlier, the geometric mean of ΔP and ΔP’ is Matthews Correlation

(Perruchet & Pereman, 2004), and kappas and correlations all correspond to different

normalizations of the determinant of the contingency matrix [13]. It is noted that ΔP’ is

recall-like, based on the rate we recall or predict each class, whilst ΔP is precision-like,

based on the accuracy of our predictions of each label.

The Kappa interpretation of ΔP and ΔP’ in terms of correction for Prevalence and

Bias [9,13] is apparent from the following equations (noting that Prev<1 is assumed,

and Bias<1 is thus a requirement of informed prediction, and E(Acc)<1 for any

standard Kappa model):

Kappa = (Accuracy–E(Acc)) / (1–E(Acc)) (3)

ΔP’ = (Recall – Bias) / (1 – Prevalence) (3)

ΔP = (Precision–Prevalence)/(1 – Bias) (4)

If we think only in terms of the positive class, and have an example with high natural

prevalence, such as water being a noun say 90% of the time, then it is possible to do

better by guessing noun all the time than by using a part of speech determining

algorithm that is only say 75% accurate [6]. Then if we are guessing our Precision

will follow Prevalence (90% of our noun predictions will be nouns) and Recall will

follow Bias (100% of our noun occurences will be recalled correctly, 0% of the others).

We can see that these chance levels are subtracted off in (3) and (4), but unlike the

usual kappas, a different chance level estimate is used in the denominator for

normalization to a probability – and unlike the other kappas, we actually have a well

defined probability as the probability of an informed prediction or of a marked

predictor resp. The insight into the alternate denominator comes from consideration

of the amount of room for improvement. The gain due to Bias in (3) is relative to

the chance level set by Prevalence, as ΔP’ can increase only so much by dealing with

only one class – how much is missed by this blind ‘positive’ focus of tpr or Recall on

the positive class is captured by the Inverse Prevalence, (1 – Prevalence).

Informedness and Markedness in the general multiclass case, with K classes and the

corresponding one-vs-rest dichotomus statistics indexed by k, are simply

Informedness = Σ

k

Bias

k

ΔP

k

’ (5)

Markedness = Σ

k

Prev

k

ΔP

k

(6)

Informedness can also be characterized as an average cost over the contingency table

cells c

pr

where the cost of a particular prediction p versus the real class r is given by

the Bookmaker odds: what you win or lose is inversely determined by the prevalence

of the horse you predict (bet on) winning (p=r) or losing (p≠r) – using a programming

convention for Boolean expressions here, (true,false)=(1,0), define Gain G

pr

to have

G

pr

= 1/(Prev

p

–D

pr

) where D

pr

= (p≠r) (7)

Informedness = Σ

p

Bias

p

[Σ

r

c

pr

G

pr

]

(8)

Archived at the Flinders Academic Commons: http://dspace.flinders.edu.au/dspace/