Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

Machine learning

We explore the effect of dimensionality on the nearest neighbor problem. We show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance to the farthest data point. To provide a practical perspective, we present empirical results on both real and synthetic data sets that demonstrate that this effect can occur for as few as 10-15 dimensions. These results should not be interpreted to mean that high-dimensional indexing is never meaningful; we illustrate this point by identifying some high-dimensional workloads for which this effect does not occur. However, our results do emphasize that the methodology used almost universally in the database literature to evaluate high-dimensional indexing techniques is flawed, and should be modified. In particular, most such techniques proposed in the literature are not evaluated versus simple linear scan, and are evaluated over workloads for which nearest neighbor is not meaningful. Often, even the reported experiments, when analyzed carefully, show that linear scan would outperform the techniques being proposed on the workloads studied in high (10-15) dimensionality!.

When is nearest neighbor meaningful

Formally we have arrived at the middle of the book. So you may need a pause for recovering, a pause which we want to fill up by some fixed point theorems supplementing those which you already met or which you will meet in later chapters. The first group of results centres around Banach’s fixed point theorem. The latter is certainly a nice result since it contains only one simple condition on the map F, since it is so easy to prove and since it nevertheless allows a variety of applications. Therefore it is not astonishing that many mathematicians have been attracted by the question to which extent the conditions on F and the space Ω can be changed so that one still gets the existence of a unique or of at least one fixed point. The number of results produced this way is still finite, but of a statistical magnitude, suggesting at a first glance that only a random sample can be covered by a chapter or even a book of the present size. Fortunately (or unfortunately?) most of the modifications have not found applications up to now, so that there is no reason to write a cookery book about conditions but to write at least a short outline of some ideas indicating that this field can be as interesting as other chapters. A systematic account of more recent ideas and examples in fixed point theory should however be written by one of the true experts. Strange as it is, such a book does not seem to exist though so many people are puzzling out so many results.

Fixed Point Theory

The large number of potential applications from bridging web data with knowledge bases have led to an increase in the entity linking research. Entity linking is the task to link entity mentions in text with their corresponding entities in a knowledge base. Potential applications include information extraction, information retrieval, and knowledge base population. However, this task is challenging due to name variations and entity ambiguity. In this survey, we present a thorough overview and analysis of the main approaches to entity linking, and discuss various applications, the evaluation of entity linking systems, and future directions.

/pdf/entity-linking-with-a-knowledge-base-issues-techniques-and-2ds9mn4o8g.pdf

Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions

Introduction. 1. Mathematical Preliminaries. Submodular Systems and Base Polyhedra. 2. From Matroids to Submodular Systems. Matroids. Polymatroids. Submodular Systems. 3. Submodular Systems and Base Polyhedra. Fundamental Operations on Submodular Systems. Greedy Algorithm. Structures of Base Polyhedra. Intersecting- and Crossing-Submodular Functions. Related Polyhedra. Submodular Systems of Network Type. Neoflows. 4. The Intersection Problem. The Intersection Theorem. The Discrete Separation Theorem. The Common Base Problem. 5. Neoflows. The Equivalence of the Neoflow Problems. Feasibility for Submodular Flows. Optimality for Submodular Flows. Algorithms for Neoflows. Matroid Optimization. Submodular Analysis. 6. Submodular Functions and Convexity. Conjugate Functions and a Fenchel-Type Min-Max Theorem for Submodular and Supermodular Functions. Subgradients of Submodular Functions. The Lovasz Extensions of Submodular Functions. 7. Submodular Programs. Submodular Programs - Unconstrained Optimization. Submodular Programs - Constrained Optimization. Nonlinear Optimization with Submodular Constraints. 8. Separable Convex Optimization. Optimality Conditions. A Decomposition Algorithm. Discrete Optimization. 9. The Lexicographically Optimal Base Problem. Nonlinear Weight Functions. Linear Weight Functions. 10. The Weighted Max-Min and Min-Max Problems. Continuous Variables. Discrete Variables. 11. The Fair Resource Allocation Problem. Continuous Variables. Discrete Variables. 12. The Neoflow Problem with a Separable Convex Cost Function. References. Index.

Submodular functions and optimization

We study the following problem: given the name of an ad-hoc concept as well as a few seed entities belonging to the concept, output all entities belonging to it. Since producing the exact set of entities is hard, we focus on returning a ranked list of entities. Previous approaches either use seed entities as the only input, or inherently require negative examples. They suffer from input ambiguity and semantic drift, or are not viable options for ad-hoc tail concepts. In this paper, we propose to leverage the millions of tables on the web for this problem. The core technical challenge is to identify the ``exclusive'' tables for a concept to prevent semantic drift; existing holistic ranking techniques like personalized PageRank are inadequate for this purpose. We develop novel probabilistic ranking methods that can model a new type of table-entity relationship. Experiments with real-life concepts show that our proposed solution is significantly more effective than applying state-of-the-art set expansion or holistic ranking techniques.

/pdf/concept-expansion-using-web-tables-2eug7p2wbj.pdf

Concept Expansion Using Web Tables

Keyword search over entity databases (e.g., product, movie databases) is an important problem. Current techniques for keyword search on databases may often return incomplete and imprecise results. On the one hand, they either require that relevant entities contain all (or most) of the query keywords, or that relevant entities and the query keywords occur together in several documents from a known collection. Neither of these requirements may be satisfied for a number of user queries. Hence results for such queries are likely to be incomplete in that highly relevant entities may not be returned. On the other hand, although some returned entities contain all (or most) of the query keywords, the intention of the keywords in the query could be different from that in the entities. Therefore, the results could also be imprecise.To remedy this problem, in this paper, we propose a general framework that can improve an existing search interface by translating a keyword query to a structured query. Specifically, we leverage the keyword to attribute value associations discovered in the results returned by the original search interface. We show empirically that the translated structured queries alleviate the above problems.

/pdf/keyword-a-framework-to-improve-keyword-search-over-entity-1ls19aqquy.pdf

Keyword++: a framework to improve keyword search over entity databases

Traditional equi-join relies solely on string equality comparisons to perform joins However, in scenarios such as ad-hoc data analysis in spreadsheets, users increasingly need to join tables whose join-columns are from the same semantic domain but use different textual representations, for which transformations are needed before equi-join can be performed We developed Auto-Join, a system that can automatically search over a rich space of operators to compose a transformation program, whose execution makes input tables equi-join-able We developed an optimal sampling strategy that allows Auto-Join to scale to large datasets efficiently, while ensuring joins succeed with high probability Our evaluation using real test cases collected from both public web tables and proprietary enterprise tables shows that the proposed system performs the desired transformation joins efficiently and with high quality

/pdf/auto-join-joining-tables-by-leveraging-transformations-1erv3tch19.pdf

Auto-join: joining tables by leveraging transformations

In comparison to the extensive body of existing work considering publish-once, static anonymization, dynamic anonymization is less well studied. Previous work, most notably m-invariance, has made considerable progress in devising a scheme that attempts to prevent individual records from being associated with too few sensitive values. We show, however, that in the presence of updates, even an m-invariant table can be exploited by a new type of attack we call the “equivalence-attack.” To deal with the equivalence attack, we propose a graph-based anonymization algorithm that leverages solutions to the classic “min-cut/max-flow” problem, and demonstrate with experiments that our algorithm is efficient and effective in preventing equivalence attacks.

/pdf/preventing-equivalence-attacks-in-updated-anonymized-data-4htswrryk9.pdf

Preventing equivalence attacks in updated, anonymized data

Data preparation is widely recognized as the most time-consuming process in modern business intelligence (BI) and machine learning (ML) projects. Automating complex data preparation steps (e.g., Pivot, Unpivot, Normalize-JSON, etc.)holds the potential to greatly improve user productivity, and has therefore become a central focus of research. We propose a novel approach to "auto-suggest" contextualized data preparation steps, by "learning" from how data scientists would manipulate data, which are documented by data science notebooks widely available today. Specifically, we crawled over 4M Jupyter notebooks on GitHub, and replayed them step-by-step, to observe not only full input/output tables (data-frames) at each step, but also the exact data-preparation choices data scientists make that they believe are best suited to the input data (e.g., how input tables are Joined/Pivoted/Unpivoted, etc.). By essentially "logging" how data scientists interact with diverse tables, and using the resulting logs as a proxy of "ground truth", we can learn-to-recommend data preparation steps best suited to given user data, just like how search engines (Google or Bing) leverage their click-through logs to learn-to-rank documents. This data-driven and log-driven approach leverages the "collective wisdom" of data scientists embodied in the notebooks, and is shown to significantly outperform strong baselines including commercial systems in terms of accuracy.

Yeye He

Papers

Concept Expansion Using Web Tables

Keyword++: a framework to improve keyword search over entity databases

Auto-join: joining tables by leveraging transformations

Preventing equivalence attacks in updated, anonymized data

Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks