scispace - formally typeset
Search or ask a question
Author

Huaiyu Zhu

Other affiliations: Aston University, Santa Fe Institute
Bio: Huaiyu Zhu is an academic researcher from IBM. The author has contributed to research in topics: Information extraction & Bayesian probability. The author has an hindex of 18, co-authored 56 publications receiving 1242 citations. Previous affiliations of Huaiyu Zhu include Aston University & Santa Fe Institute.


Papers
More filters
Journal ArticleDOI
20 Mar 2009
TL;DR: The extraction algebra is described and the effectiveness of the optimization techniques in providing orders of magnitude reduction in the running time of complex extraction tasks are demonstrated.
Abstract: As applications within and outside the enterprise encounter increasing volumes of unstructured data, there has been renewed interest in the area of information extraction (IE) -- the discipline concerned with extracting structured information from unstructured text. Classical IE techniques developed by the NLP community were based on cascading grammars and regular expressions. However, due to the inherent limitations of grammarbased extraction, these techniques are unable to: (i) scale to large data sets, and (ii) support the expressivity requirements of complex information tasks. At the IBM Almaden Research Center, we are developing SystemT, an IE system that addresses these limitations by adopting an algebraic approach. By leveraging well-understood database concepts such as declarative queries and costbased optimization, SystemT enables scalable execution of complex information extraction tasks. In this paper, we motivate the SystemT approach to information extraction. We describe our extraction algebra and demonstrate the effectiveness of our optimization techniques in providing orders of magnitude reduction in the running time of complex extraction tasks.

160 citations

Proceedings ArticleDOI
07 Apr 2008
TL;DR: This work proposes an algebraic approach to rule-based IE that addresses scalability issues through query optimization and presents the operators of this algebra and proposes several optimization strategies motivated by the text-specific characteristics of the operators.
Abstract: Traditional approaches to rule-based information extraction (IE) have primarily been based on regular expression grammars However, these grammar-based systems have difficulty scaling to large data sets and large numbers of rules Inspired by traditional database research, we propose an algebraic approach to rule-based IE that addresses these scalability issues through query optimization The operators of our algebra are motivated by our experience in building several rule-based extraction programs over diverse data sets We present the operators of our algebra and propose several optimization strategies motivated by the text-specific characteristics of our operators Finally we validate the potential benefits of our approach by extensive experiments over real-world blog data

129 citations

03 Jul 1997
TL;DR: The problem of regression under Gaussian assumptions is treated in this paper, where the relationship between Bayesian prediction, regularization and smoothing is elucidated and the ideal regression is the posterior mean and its computation scales as O(n 3 ).
Abstract: The problem of regression under Gaussian assumptions is treated generally. The relationship between Bayesian prediction, regularization and smoothing is elucidated. The ideal regression is the posterior mean and its computation scales as O(n 3 ) , where n is the sample size. We show that the optimal m -dimensional linear model under a given prior is spanned by the first m eigenfunctions of a covariance operator, which is a trace-class operator. This is an infinite dimensional analogue of principal component analysis. The importance of Hilbert space methods to practical statistics is also discussed.

114 citations

Proceedings ArticleDOI
27 Jun 2006
TL;DR: In this demonstration the overall architecture of the Avatar Semantic Search engine is described and the superiority of the AVATAR approach over traditional keyword search engines using Enron email data set and a blog corpus is demonstrated.
Abstract: We present Avatar Semantic Search, a prototype search engine that exploits annotations in the context of classical keyword search. The process of annotations is accomplished offline by using high-precision information extraction techniques to extract facts, con-cepts, and relationships from text. These facts and concepts are represented and indexed in a structured data store. At runtime, keyword queries are interpreted in the context of these extracted facts and converted into one or more precise queries over the structured store. In this demonstration we describe the overall architecture of the Avatar Semantic Search engine. We also demonstrate the superiority of the AVATAR approach over traditional keyword search engines using Enron email data set and a blog corpus.

90 citations

Proceedings ArticleDOI
01 Jul 2015
TL;DR: This paper presents a two-stage method to enable the construction of SRL models for resourcepoor languages by exploiting monolingual SRL and multilingual parallel data and shows that this method outperforms existing methods.
Abstract: Semantic role labeling (SRL) is crucial to natural language understanding as it identifies the predicate-argument structure in text with semantic labels. Unfortunately, resources required to construct SRL models are expensive to obtain and simply do not exist for most languages. In this paper, we present a two-stage method to enable the construction of SRL models for resourcepoor languages by exploiting monolingual SRL and multilingual parallel data. Experimental results show that our method outperforms existing methods. We use our method to generate Proposition Banks with high to reasonable quality for 7 languages in three language families and release these resources to the research community.

88 citations


Cited by
More filters
Book
23 Nov 2005
TL;DR: The treatment is comprehensive and self-contained, targeted at researchers and students in machine learning and applied statistics, and deals with the supervised learning problem for both regression and classification.
Abstract: A comprehensive and self-contained introduction to Gaussian processes, which provide a principled, practical, probabilistic approach to learning in kernel machines. Gaussian processes (GPs) provide a principled, practical, probabilistic approach to learning in kernel machines. GPs have received increased attention in the machine-learning community over the past decade, and this book provides a long-needed systematic and unified treatment of theoretical and practical aspects of GPs in machine learning. The treatment is comprehensive and self-contained, targeted at researchers and students in machine learning and applied statistics. The book deals with the supervised-learning problem for both regression and classification, and includes detailed algorithms. A wide variety of covariance (kernel) functions are presented and their properties discussed. Model selection is discussed both from a Bayesian and a classical perspective. Many connections to other well-known techniques from machine learning and statistics are discussed, including support-vector machines, neural networks, splines, regularization networks, relevance vector machines and others. Theoretical issues including learning curves and the PAC-Bayesian framework are treated, and several approximation methods for learning with large datasets are discussed. The book contains illustrative examples and exercises, and code and datasets are available on the Web. Appendixes provide mathematical background and a discussion of Gaussian Markov processes.

11,357 citations

Journal Article
TL;DR: In this paper, the authors describe the EM algorithm for finding the parameters of a mixture of Gaussian densities and a hidden Markov model (HMM) for both discrete and Gaussian mixture observation models.
Abstract: We describe the maximum-likelihood parameter estimation problem and how the ExpectationMaximization (EM) algorithm can be used for its solution. We first describe the abstract form of the EM algorithm as it is often given in the literature. We then develop the EM parameter estimation procedure for two applications: 1) finding the parameters of a mixture of Gaussian densities, and 2) finding the parameters of a hidden Markov model (HMM) (i.e., the Baum-Welch algorithm) for both discrete and Gaussian mixture observation models. We derive the update equations in fairly explicit detail but we do not prove any convergence properties. We try to emphasize intuition rather than mathematical rigor.

2,455 citations

Journal ArticleDOI
TL;DR: It is shown that one cannot say: if empirical misclassification rate is low, the Vapnik-Chervonenkis dimension of your generalizer is small, and the training set is large, then with high probability your OTS error is small.
Abstract: This is the first of two papers that use off-training set (OTS) error to investigate the assumption-free relationship between learning algorithms. This first paper discusses the senses in which there are no a priori distinctions between learning algorithms. (The second paper discusses the senses in which there are such distinctions.) In this first paper it is shown, loosely speaking, that for any two algorithms A and B, there are “as many” targets (or priors over targets) for which A has lower expected OTS error than B as vice versa, for loss functions like zero-one loss. In particular, this is true if A is cross-validation and B is “anti-cross-validation” (choose the learning algorithm with largest cross-validation error). This paper ends with a discussion of the implications of these results for computational learning theory. It is shown that one cannot say: if empirical misclassification rate is low, the Vapnik-Chervonenkis dimension of your generalizer is small, and the training set is large, then with high probability your OTS error is small. Other implications for “membership queries” algorithms and “punting” algorithms are also discussed.

1,371 citations

Journal ArticleDOI
29 Nov 1999
TL;DR: This work performs a theoretical investigation of the variance of a variant of the cross-validation estimator of the generalization error that takes into account the variability due to the randomness of the training set as well as test examples and proposes new estimators of this variance.
Abstract: In order to compare learning algorithms, experimental results reported in the machine learning literature often use statistical tests of significance to support the claim that a new learning algorithm generalizes better. Such tests should take into account the variability due to the choice of training set and not only that due to the test examples, as is often the case. This could lead to gross underestimation of the variance of the cross-validation estimator, and to the wrong conclusion that the new algorithm is significantly better when it is not. We perform a theoretical investigation of the variance of a variant of the cross-validation estimator of the generalization error that takes into account the variability due to the randomness of the training set as well as test examples. Our analysis shows that all the variance estimators that are based only on the results of the cross-validation experiment must be biased. This analysis allows us to propose new estimators of this variance. We show, via simulations, that tests of hypothesis about the generalization error using those new variance estimators have better properties than tests involving variance estimators currently in use and listed in Dietterich (1998). In particular, the new tests have correct size and good power. That is, the new tests do not reject the null hypothesis too often when the hypothesis is true, but they tend to frequently reject the null hypothesis when the latter is false.

925 citations