# Learning Grammars and Automata with Queries

01 Jan 2016-pp 47-71

TL;DR: By controlling better the information to which one has access, this setting provides a better understanding of the hardness of learning tasks, and allows us to solve practical learning situations, for which new algorithms are needed.

Abstract: When learning languages or grammars, an attractive alternative to using a large corpus is to learn by interacting with the environment. This can allow us to deal with situations where data is scarce or expensive, but testing or experimenting is possible. The situation, which arises in a number of fields, is formalised in a setting called active learning or query learning. By controlling better the information to which one has access, this setting provides us with a better understanding of the hardness of learning tasks. But the setting also allows us to solve practical learning situations, for which new algorithms are needed.

##### Citations

More filters

•

01 Jan 2007TL;DR: An interactive learning algorithm for monadic queries defined by pruning NSTTs is implemented, which satisfies a new formal active learning model in the style of Angluin (1987), and is integrated into a visually interactive Web information extraction system by plugging it into the Mozilla Web browser.

Abstract: We develop new algorithms for learning monadic node selection queries in unranked trees from annotated examples, and apply them to visually interactive Web information extraction. We propose to represent monadic queries by bottom-up deterministic Node Selecting Tree Transducers (NSTTs), a particular class of tree automata that we introduce. We prove that deterministic NSTTs capture the class of queries definable in monadic second order logic (MSO) in trees, which Gottlob and Koch (2002) argue to have the right expressiveness for Web information extraction, and prove that monadic queries defined by NSTTs can be answered efficiently. We present a new polynomial time algorithm in RPNI-style that learns monadic queries defined by deterministic NSTTs from completely annotated examples, where all selected nodes are distinguished. In practice, users prefer to provide partial annotations. We propose to account for partial annotations by intelligent tree pruning heuristics. We introduce pruning NSTT-a formalism that shares many advantages of NSTTs. This leads us to an interactive learning algorithm for monadic queries defined by pruning NSTTs, which satisfies a new formal active learning model in the style of Angluin (1987). We have implemented our interactive learning algorithm integrated it into a visually interactive Web information extraction system-called SQUIRREL-by plugging it into the Mozilla Web browser. Experiments on realistic Web documents confirm excellent quality with very few user interactions during wrapper induction.

7 citations

••

TL;DR: In this paper , the authors proposed a block based DFA learning through inverse query (BDLIQ) using block based delta inverse strategy, which is based on the idea of inverse queries that John Hopcroft introduced for state minimization of a DFA.

Abstract: A resurgent interest for grammatical inference aka automaton learning has emerged in several intriguing areas of computer sciences such as machine learning, software engineering, robotics and internet of things. An automaton learning algorithm commonly uses queries to learn the regular grammar of a Deterministic Finite Automaton (DFA). These queries are posed to a Minimum Adequate Teacher (MAT) by the learner (Learning Algorithm). The membership and equivalence queries which the learning algorithm may pose, are often capable of having their answers provided by the MAT. The three main categories of learning algorithms are incremental, sequential, and complete learning algorithms. In the presence of a MAT, the time complexity of existing DFA learning algorithms is polynomial. Therefore, in some applications these algorithms may fail to learn the system. In this study, we have reduced the time complexity of DFA learning from polynomial to logarithmic form. For this, we propose an efficient complete DFA learning algorithm; the Block based DFA Learning through Inverse Query (BDLIQ) using block based delta inverse strategy, which is based on the idea of inverse queries that John Hopcroft introduced for state minimization of a DFA. The BDLIQ algorithm possess $O(\vert \Sigma \vert N.log N)$ complexity when a MAT is available. The MAT is also made capable of responding to inverse queries. We provide theoretical and empirical analysis of the proposed algorithm. Results show that our suggested approach for complete learning; BDLIQ algorithm, is more efficient than the ID algorithm in terms of time complexity.

••

TL;DR: In this article , an efficient incremental learning algorithm for deterministic finite automata with the help of inverse query (IQ) and membership query (MQ) is presented, which is an extension of the Identification of Regular Languages (ID) algorithm from a complete to an incremental learning setup.

Abstract: We present an efficient incremental learning algorithm for Deterministic Finite Automaton (DFA) with the help of inverse query (IQ) and membership query (MQ). This algorithm is an extension of the Identification of Regular Languages (ID) algorithm from a complete to an incremental learning setup. The learning algorithm learns by making use of a set of labeled examples and by posing queries to a knowledgeable teacher, which is equipped to answer IQs along with MQs and equivalence query. Based on the examples (elements of the live complete set) and responses against IQs from the minimally adequate teacher (MAT), the learning algorithm constructs the hypothesis automaton, consistent with all observed examples. The Incremental DFA Learning algorithm through Inverse Queries (IDLIQ) takes O(|Σ|N+|Pc||F|) time complexity in the presence of a MAT and ensures convergence to a minimal representation of the target DFA with finite number of labeled examples. Existing incremental learning algorithms; the Incremental ID, the Incremental Distinguishing Strings have polynomial (cubic) time complexity in the presence of a MAT. Therefore, sometimes, these algorithms even fail to learn large complex software systems. In this research work, we have reduced the complexity (from cubic to square form) of the DFA learning in an incremental setup. Finally, we prove the correctness and termination of the IDLIQ algorithm.

••

TL;DR: In this paper , the authors proposed an efficient complete DFA learning algorithm; the Block based DFA Learning through Inverse Query (BDLIQ) using block based delta inverse strategy, which is based on the idea of inverse queries that John Hopcroft introduced for state minimization of a DFA.

Abstract: A resurgent interest for grammatical inference aka automaton learning has emerged in several intriguing areas of computer sciences such as machine learning, software engineering, robotics and internet of things. An automaton learning algorithm commonly uses queries to learn the regular grammar of a Deterministic Finite Automaton (DFA). These queries are posed to a Minimum Adequate Teacher (MAT) by the learner (Learning Algorithm). The membership and equivalence queries which the learning algorithm may pose, are often capable of having their answers provided by the MAT. The three main categories of learning algorithms are incremental, sequential, and complete learning algorithms. In the presence of a MAT, the time complexity of existing DFA learning algorithms is polynomial. Therefore, in some applications these algorithms may fail to learn the system. In this study, we have reduced the time complexity of DFA learning from polynomial to logarithmic form. For this, we propose an efficient complete DFA learning algorithm; the Block based DFA Learning through Inverse Query (BDLIQ) using block based delta inverse strategy, which is based on the idea of inverse queries that John Hopcroft introduced for state minimization of a DFA. The BDLIQ algorithm possess $O(\vert \Sigma \vert N.log N)$ complexity when a MAT is available. The MAT is also made capable of responding to inverse queries. We provide theoretical and empirical analysis of the proposed algorithm. Results show that our suggested approach for complete learning; BDLIQ algorithm, is more efficient than the ID algorithm in terms of time complexity.

##### References

More filters

••

05 Nov 1984TL;DR: This paper regards learning as the phenomenon of knowledge acquisition in the absence of explicit programming, and gives a precise methodology for studying this phenomenon from a computational viewpoint.

Abstract: Humans appear to be able to learn new concepts without needing to be programmed explicitly in any conventional sense. In this paper we regard learning as the phenomenon of knowledge acquisition in the absence of explicit programming. We give a precise methodology for studying this phenomenon from a computational viewpoint. It consists of choosing an appropriate information gathering mechanism, the learning protocol, and exploring the class of concepts that can be learnt using it in a reasonable (polynomial) number of steps. We find that inherent algorithmic complexity appears to set serious limits to the range of concepts that can be so learnt. The methodology and results suggest concrete principles for designing realistic learning systems.

5,311 citations

••

TL;DR: It was found that theclass of context-sensitive languages is learnable from an informant, but that not even the class of regular languages is learningable from a text.

Abstract: Language learnability has been investigated. This refers to the following situation: A class of possible languages is specified, together with a method of presenting information to the learner about an unknown language, which is to be chosen from the class. The question is now asked, “Is the information sufficient to determine which of the possible languages is the unknown language?” Many definitions of learnability are possible, but only the following is considered here: Time is quantized and has a finite starting time. At each time the learner receives a unit of information and is to make a guess as to the identity of the unknown language on the basis of the information received so far. This process continues forever. The class of languages will be considered learnable with respect to the specified method of information presentation if there is an algorithm that the learner can use to make his guesses, the algorithm having the following property: Given any language of the class, there is some finite time after which the guesses will all be the same and they will be correct. In this preliminary investigation, a language is taken to be a set of strings on some finite alphabet. The alphabet is the same for all languages of the class. Several variations of each of the following two basic methods of information presentation are investigated: A text for a language generates the strings of the language in any order such that every string of the language occurs at least once. An informant for a language tells whether a string is in the language, and chooses the strings in some order such that every string occurs at least once. It was found that the class of context-sensitive languages is learnable from an informant, but that not even the class of regular languages is learnable from a text.

3,460 citations

••

TL;DR: An algorithm is presented which solves the string-to-string correction problem in time proportional to the product of the lengths of the two strings.

Abstract: The string-to-string correction problem is to determine the distance between two strings as measured by the minimum cost sequence of “edit operations” needed to change the one string into the other. The edit operations investigated allow changing one symbol of a string into another single symbol, deleting one symbol from a string, or inserting a single symbol into a string. An algorithm is presented which solves this problem in time proportional to the product of the lengths of the two strings. Possible applications are to the problems of automatic spelling correction and determining the longest subsequence of characters common to two strings.

3,252 citations

••

Yale University

^{1}TL;DR: In this article, the problem of identifying an unknown regular set from examples of its members and nonmembers is addressed, where the regular set is presented by a minimaMy adequate teacher, which can answer membership queries about the set and can also test a conjecture and indicate whether it is equal to the unknown set and provide a counterexample if not.

Abstract: The problem of identifying an unknown regular set from examples of its members and nonmembers is addressed. It is assumed that the regular set is presented by a minimaMy adequate Teacher, which can answer membership queries about the set and can also test a conjecture and indicate whether it is equal to the unknown set and provide a counterexample if not. (A counterexample is a string in the symmetric difference of the correct set and the conjectured set.) A learning algorithm L* is described that correctly learns any regular set from any minimally adequate Teacher in time polynomial in the number of states of the minimum dfa for the set and the maximum length of any counterexample provided by the Teacher. It is shown that in a stochastic setting the ability of the Teacher to test conjectures may be replaced by a random sampling oracle, EX( ). A polynomial-time learning algorithm is shown for a particular problem of context-free language identification.

2,157 citations