What are the contributions in "A solution to wiehagen’s thesis∗" ?

Q: What are the contributions in "A solution to wiehagen’s thesis∗" ?

The authors prove the thesis for a wide range of learning criteria, including many popular criteria from the literature. The authors also show the limitations of the thesis by giving four learning criteria for which the thesis does not hold ( and, in two cases, was probably not meant to hold ). Beyond the original formulation of the thesis, the authors also prove stronger versions which allow for many corollaries relating to strongly decisive and conservative learning.

(Open Access) A Solution to Wiehagen's Thesis (2017) | Timo Kötzing

A Solution to Wiehagen’s Thesis

∗

Timo Kötzing

Friedrich-Schiller-Universität Jena, Jena, Germany

timo.koetzing@uni-jena.de

Abstract

Wiehagen’s Thesis in Inductive Inference (1991) essentially states that, for each learning criterion,

learning can be done in a normalized, enumerative way. The thesis was not a formal statement

and thus did not allow for a formal proof, but support was given by examples of a number of

diﬀerent learning criteria that can be learned enumeratively.

Building on recent formalizations of learning criteria, we are now able to formalize Wiehagen’s

Thesis. We prove the thesis for a wide range of learning criteria, including many popular criteria

from the literature. We also show the limitations of the thesis by giving four learning criteria for

which the thesis does not hold (and, in two cases, was probably not meant to hold). Beyond the

original formulation of the thesis, we also prove stronger versions which allow for many corollaries

relating to strongly decisive and conservative learning.

1998 ACM Subject Classiﬁcation I.2.6 Learning

Keywords and phrases Algorithmic Learning Theory, Wiehagen’s Thesis, Enumeration Learning

Digital Object Identiﬁer 10.4230/LIPIcs.STACS.2014.494

1 Introduction

In Gold-style learning [

] (also known as inductive inference) a learner tries to learn

an inﬁnite sequence, given more and more ﬁnite information about this sequence. For

example, a learner

might be presented longer and longer initial segments of the sequence

= 1

, . . .

. After each new datum of

may output a description of a function (for

example a Turing machine program computing that function) as its conjecture.

might

output a program for the constantly-1 function after seeing the ﬁrst element of this sequence

, and then, as soon as more data is available, a program for the squaring function. Many

criteria for saying whether

is successful on

have been proposed in the literature. Gold, in

his seminal paper [

], gave a ﬁrst, simple learning criterion, later called Ex-learning

, where

a learner is successful iﬀ it eventually stops changing its conjectures, and its ﬁnal conjecture

is a correct program (computing the input sequence).

Trivially, each single, describable sequence

has a suitable constant function as an

Ex-learner (this learner constantly outputs a description for

). Thus, we are interested

in sets of total computable functions

for which there is a single learner

learning each

member of S (those sets S are then called Ex-learnable).

Gold [

] showed an important class of sets of functions to be Ex-learnable:

each

∗

We would like to thank Sandra Zilles for bringing Wiehagen’s Thesis in connection with the approach

of abstractly deﬁning learning criteria, as well as the anonymous reviewers for their friendly and helpful

suggestions.

“Ex” stands for explanatory.

We let

{

, . . .}

be the set of natural numbers and we ﬁx a coding for programs based on Turing

machines letting, for any program (code)

p ∈ N

be the function computed by the Turing machine

coded to p.

licensed under Creative Commons License CC-BY

31st Symposium on Theoretical Aspects of Computer Science (STACS’14).

Editors: Ernst W. Mayr and Natacha Portier; pp. 494–505

Leibniz International Proceedings in Informatics

Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany

T. Kötzing 495

uniformly computable set of total functions is Ex-learnable; a set of functions

is uniformly

computable iﬀ there is a computable function

such that

{ϕ

e(n)

| n ∈ N}

. The

corresponding learner learns by enumeration: in every iteration, it ﬁnds the ﬁrst index

such that ϕ

e(n)

is consistent with all known data, and outputs e(n) as the conjecture.

However, it is well-known that there are sets which are not uniformly computable, yet

Ex-learnable. Blum and Blum [

] gave the following example. Let

be a total computable

listing of programs such that the predicate

e(n)

(

) =

is decidable in

and

. Crucially,

some of the

e(n)

may be undeﬁned on some arguments; these functions are not required

to be learned, but the set of all the total functions enumerated is Ex-learnable. This uses

the same strategy as for uniformly computable sets of functions, but this learning already

goes beyond enumeration of all and only the learned functions, as there are sets which are so

learnable, but not uniformly computable. The price is that the learner may give intermediate

conjectures

(

) which are programs for partial functions; this is necessarily so, as noted

in [9].

As already shown by Wiehagen [

], there are Ex-learnable sets of functions that cannot

be learned while always having a hypothesis that is consistent with the known data. Thus,

the above strategy for learning employed by Blum and Blum [

] is not applicable for all

learning tasks. In [

] Wiehagen was looking for whether there is a more general strategy

which also enumerates a list of candidate conjectures and is applicable to all Ex-learnable sets.

He showed that this is indeed possible, giving an insightful characterization of Ex-learning.

A main focus of the research in inductive inference deﬁnes learning criteria that are

diﬀerent from (but usually similar in ﬂavor to) Ex-learning. For example, consistent learning

requires that each conjecture is consistent with the known data; monotone learning requires

the sequence of conjectures to be monotone with respect to inclusion of the graphs of the

computed functions. Wiehagen also gives characterizations for these learning criteria and

more. Other researchers give similar characterizations; recent work in this area includes, for

example, [

]. For any learning criterion

we are again interested in sets of total computable

functions

for which there is a single learner

which learns every function in

in the sense

speciﬁed by I; we call such S I-learnable.

Wiehagen was inspired by his work to conjecture a general structure of learning, as stated

in his Thesis in Inductive Inference [18], which we rephrase in the language of this paper:

Let

be any learning criterion. Then for any

-learnable class

, an enumeration

of programs

can be constructed such that

-learnable with respect to

by an enumerative learner.

Note that [

] called a learning criterion an “inference type” and a learner an “inference

strategy”. About his thesis, Wiehagen [

] wrote that “We do not exclude that one nice day

a formal proof of this thesis will be presented. This would require ‘only’ to formalize the

notions of ‘inference type’ and ‘enumerative inference strategy’ which does not seem to be

hopeless. But up to this moment we prefer ‘verifying’ our thesis analogously as it has been

done with ‘verifying’ Church’s thesis, namely by formally proving it for ‘real’, reasonable,

distinct inference types.”

Recently, the notion of a learning criterion was formalized in [

] (see Section 2.1 for the

formal notions relevant to this paper). Our ﬁrst contribution in this paper is a formalization

of “enumeration learner” in Deﬁnition 2. It is in the nature of the very general thesis that any

formalization may be too broad in some respects and too narrow in other. For example, our

formalizations exclude some learning criteria, such as ﬁnite learning, learning by non-total

S TA C S ’ 1 4

496 A Solution to Wiehagen’s Thesis

learners, and criteria featuring global restrictions on the learner. However, for the scope of

our deﬁnitions, we already get very strong and insightful results in this paper.

In Theorem 3 we discuss four diﬀerent learning criteria in which the thesis does not hold.

The ﬁrst one is prediction, which attaches a totally diﬀerent meaning to the “conjectures”

than Ex-learning (the thesis was probably never meant to hold for such learning criteria).

The second criterion involves mandatory oscillation between (correct) conjectures, which is in

immediate contradiction to enumerative learning. The third learning criterion is transductive

learning, where the learner has very little information in each iteration. The fourth is

learning in a non-standard hypothesis space. The last two learning criteria do not contradict

enumerative learning directly, but still demand too much for learning by enumeration.

In Section 4 we show that there is a broad core of learning criteria for which Wiehagen’s

Thesis holds. For this we introduce the notion of a pseudo-semantic restriction, where only

the semantics of conjectures and possibly the occurrence of mind changes matter, but not

other parts of their syntax. Theorem 10 shows that Wiehagen’s Thesis holds in the case of

full information learning (like in Ex-learning given above, where the learner only gets more

information in each iteration) when all restrictions are pseudo-semantic, and in Theorem 16

we see that the same holds in the case of iterative learning (a learning model in which a

learner has a restricted memory). Note that these two theorems already cover a very wide

range of learning criteria from the literature, including all given by Wiehagen [18].

Finally, going beyond the scope of Wiehagen’s Thesis, we show that we can assume the

enumeration

of programs to be semantically 1-1 (each

(

) codes for a diﬀerent function)

if we assume a little bit more about the learning criteria, namely that their restrictions allow

for patching and erasing (see Deﬁnition 11). This is formally shown in Theorem 13 (for

the case of full information learning) and in Theorem 17 (for the case of iterative learning).

Example criteria to which these theorems apply include Ex-learning, as well as consistent

and monotone learning. Wiehagen [

] already pointed out in special cases that one can get

such semantically 1-1 enumerations. From these results on learning with a semantically 1-1

enumeration we can derive corollaries to conclude that the learning criteria, to which the

theorems apply, allow for strongly decisive and conservative learning (see Deﬁnition 1); for

example, for plain Ex-learning, this proves (a stronger version of) a result from [

] (which

showed that Ex-learning can be done decisively). Note that all positive results are suﬃcient

conditions for enumerative learnability; except for the (weak) condition given in Remark 9,

we could not ﬁnd interesting necessary conditions.

The beneﬁts of this work are threefold. First, we address a long-open problem in its

essential parts. Second, we derive results about (strongly) decisive and conservative learning

in many diﬀerent settings. Finally, we further develop general techniques to derive powerful

theorems applicable to many diﬀerent learning criteria, thanks to general notions such as

“pseudo-semantic restriction”.

Note that we omit a number of nontrivial proofs due to space constraints.

2 Mathematical Preliminaries

We ﬁx any computable 1-1 and onto pairing function

h·, ·i

N × N → N

; Whenever we

consider tuples of natural numbers as input to a function, it is understood that the general

coding function

h·, ·i

is used to code the tuples into a single natural number. We similarly ﬁx

a coding for ﬁnite sets and sequences, so that we can use those as input as well. We use

∅

denote the empty sequence; for every non-empty sequence

we let

−

denote the sequence

derived from σ by dropping the last listed element.

T. Kötzing 497

If a function

is not deﬁned for some argument

, then we denote this fact by

(

)

↑

and we say that

diverges; the opposite is denoted by

(

)

↓

, and we say that

converges. If

converges to

, then we denote this fact by

(

)

↓

. For any total

computable predicate

, we use

µx P

(

) to denote the minimal

such that

(

) (undeﬁned,

if no such

exists). The special symbol ? is used as a possible hypothesis (meaning “no

change of hypothesis”).

Unintroduced notation for computability theory follows [

and

denote, respectively,

the set of all partial computable and the set of all computable functions (mapping

N → N

For any function

N → N

and all

, we use

[

] to denote the sequence

(0), . . . ,

(

i −

(undeﬁned, if any one of these values is undeﬁned).

We will use a number of basic computability-theoretic results in this paper. First, we

ﬁx a padding function, a 1-1 function

pad ∈ R

such that

∀p, n, x

pad(p,n)

(

) =

(

)

Intuitively,

pad

generates inﬁnitely many syntactically diﬀerent copies of the semantically

same program. We require that

pad

is monotone increasing in both arguments. The S-m-n

Theorem states that there is a 1-1 function

s ∈ R

such that

∀p, n, x

s(p,n)

(

) =

(

n, x

)

Intuitively, s-m-n allows for “hard-coding” arguments to a program.

2.1 Learning Criteria

In this section we formally introduce our setting of learning in the limit and associated

learning criteria. We follow [

] in its “building-blocks” approach for deﬁning learning criteria.

A learner is a partial computable function from

N ∪ {

}

. A sequence generating operator

is a function

taking as arguments a function

(the learner) and a function

(the learnee)

and that outputs a function

. We call

the conjecture sequence of

given

. Intuitively,

deﬁnes how a learner can interact with a given learnee to produce a sequence of conjectures.

The most important sequence generating operator is

(which stands for “Gold”, who

ﬁrst studied it [

]), which gives the learner full information about the learning process so

far; this corresponds to the examples of learning criteria given in the introduction. Formally,

G is deﬁned such that

∀h, g, i : G(h, g)(i) = h(g[i]).

We deﬁne two additional sequence generating operators

(iterative learning, [

]) and

(transductive learning, [8]) as follows. For all learners h, learnees g and all i,

It(h, g)(i) =

(

h(∅), if i = 0;

h(It(h, g)(i − 1), i − 1, g(i − 1)), otherwise;

Td(h, g)(i) =











h(∅), if i = 0;

Td(h, g)(i − 1), else, if h(i − 1, g(i − 1)) = ?;

h(i − 1, g(i − 1)), otherwise.

For both of iterative and transductive learning, the learner is presented with a new datum

each turn (argument/value pair from the learnee in complete and argument-increasing order).

Furthermore, in iterative learning, the learner has access to the previous conjecture, but not

so in transductive learning; however, in transductive learning, the learner can implicitly take

over the previous conjecture by outputting “?”.

Successful learning requires the learner to observe certain restrictions, for example

convergence to a correct index. These restrictions are formalized in our next deﬁnition. A

h(∅) denotes the initial conjecture (based on no data) made by h.

S TA C S ’ 1 4

498 A Solution to Wiehagen’s Thesis

sequence acceptance criterion is a predicate

on a learning sequence and a learnee. The most

important sequence acceptance criterion is denoted

(which stands for “Explanatory”),

already studied by Gold [

]. The requirement is that the conjecture sequence converges (in

the limit) to a correct hypothesis for the learnee (we met this requirement already in the

introduction). Formally, for any programming system

, we deﬁne

as a predicate such

that

= {(p, g) ∈ R

| ∃n

, q : ∀n ≥ n

: p(n) = q ∧ ψ

= g}.

Standardly we use

. We will meet many more sequence acceptance criteria below.

We combine any two sequence acceptance criteria δ and δ

by intersecting them; we denote

this by juxtaposition (for example, the sequence acceptance criteria given below are meant

to be always used together with Ex).

For any set

C ⊆ P

of possible learners, any sequence generating operator

and any

sequence acceptance criterion

, (

C, β, δ

) (or, for short,

Cβδ

) is a learning criterion. A

learner

h ∈ C Cβδ

-learns the set

Cβδ

(

) =

{g ∈ R | δ

(

h, g

)

, g

)

A set

S ⊆ R

of possible

learnees is called

Cβδ

-learnable iﬀ there is a function

h ∈ C

which

Cβδ

-learns all elements of

(possibly more). Abusing notation, we also use

Cβδ

to denote the set of all

Cβδ

-learnable

sets (learnable by some learner).

Next we deﬁne a number of further sequence acceptance criteria which are of interest for

this paper.

I Deﬁnition 1.

With

Cons

we denote the restriction of consistent learning [

] (being

correct on all known data); with

Conf

the restriction of conformal learning [

] (being

correct or divergent on known data); with

Conv

we denote the restriction of conservative

learning [

] (never abandoning a conjecture which is correct on all known data); with

Mon

we denote the restriction of monotone learning [

] (conjectures make all the outputs that

previous conjectures made – monotonicity in the graphs); ﬁnally, with

PMon

we denote the

restriction of pseudo-monotone learning [

] (conjectures make all the correct outputs that

previous conjectures made). The following deﬁnitions formalize these restrictions.

Conf = {(p, g) ∈ R

| ∀n∀x < n : ϕ

p(n)

(x)↓ ⇒ ϕ

p(n)

(x) = g(x)};

Cons = {(p, g) ∈ R

| ∀n∀x < n : ϕ

p(n)

(x) = g(x)};

Conv = {(p, g) ∈ R

| ∀n : p(n) 6= p(n + 1) ⇒ ∃x < n + 1 : ϕ

p(n)

(x) 6= g(x)};

Mon = {(p, g) ∈ R

| ∀i ≤ j ∀x : ϕ

p(i)

(x)↓ ⇒ ϕ

p(j)

(x)↓ = ϕ

p(i)

(x)};

PMon = {(p, g) ∈ R

| ∀i ≤ j ∀x : ϕ

p(i)

(x)↓ = g(x) ⇒ ϕ

p(j)

(x)↓ = ϕ

p(i)

(x)}.

An example of a well-studied learning criterion is

RGConsEx

, requiring convergence of the

learner to a correct conjecture, as well as consistent conjectures along the way.

Furthermore, we are interested in a number of restrictions which disallow certain kinds

of returning to abandoned conjectures. We say that a learner exhibits a U-shape when it

ﬁrst outputs a correct conjecture, abandons this, and then returns to a correct conjecture.

We distinguish between syntactic U-shapes (returning to the syntactically same conjecture),

semantic U-shapes (returning to the semantically same conjecture, after semantically aban-

doning it; note that we drop the qualiﬁer “semantic” in this case) and strong U-shapes

(outputting a semantically same conjecture after syntactically abandoning it; this is called

strong, because it leads to the stronger restriction). Forbidding these kinds of U-shapes leads

We call

a programming system iﬀ, for all

is a computable function, and the function mapping

any p and x to ψ

(x) is also (partial) computable.

A Solution to Wiehagen's Thesis

Citations

A Map of Update Constraints in Inductive Inference

Normal Forms in Semantic Language Identification

Maps for Learning Indexable Classes.

References

Learning secrets interactively. Dynamic modeling in inductive inference

Learning languages and functions by erasing

Related Papers (5)

Computability-theoretic learning complexity

Grammatical Inference and First Language Acquisition

Continuity and comprehension in intuitionistic formal systems

Implicitly Learning to Reason in First-Order Logic.

On the tractability of learning from incomplete theories

Frequently Asked Questions (1)

Q1. What are the contributions in "A solution to wiehagen’s thesis∗" ?