Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

Pattern Recognition and Machine Learning

Examples are discussed to show the differences among discriminant analysis, logistic regression, and multiple regression. Chapter 6, “Multivariate Analysis of Variance,” presents advantages of multivariate analysis of variance (MANOVA) over univariate analysis of variance (ANOVA), discusses assumptions of MANOVA, and assesses validations of MANOVA assumptions and model estimation. The authors also discuss post hoc tests of MANOVA and multivariate analysis of covariance. Chapter 7, “Conjoint Analysis,” explains what conjoint analysis does and how it is different from other multivariate techniques. Guidelines of selecting attributes, models, and methods of data collection are presented. Chapter 8, “Cluster Analysis,” studies objectives, roles, and limitations of cluster analysis. Two basic concepts: similarity and distance are discussed. The authors also discuss details of five most popular hierarchical algorithms (singlelinkage, complete-linkage, average-linkage, centroid method, Ward’s method) and three nonhierarchical algorithms (the sequential threshold method, the parallel threshold method, and the optimizing procedure). Profiles of clusters and guidelines for cluster validation are studied as well. Chapter 9, “Multidimensional Scaling and Correspondence Analysis,” introduces two interdependence techniques to display the relationships in the data. The book describes clearly and intuitively the differences between the two techniques and how these two techniques are performed. Chapters 10–12 cover topics in SEM. Chapter 10, “Structural Equation Modeling: An Introduction,” introduces SEM and related concepts such as exogenous, endogenous constructs, and so on, points out the differences between SEM and other multivariate techniques, overviews the decision process of SEM. Chapter 11, “Confirmatory Factor Analysis,” explains the differences between exploratory and confirmatory factor analysis, discusses how to construct, validate, and assess the goodness of fit of a measurement model in SEM by confirmatory factor analysis. Chapter 12, “Testing a Structural Model,” presents some methods of SEM in examining the relationships between latent constructs. The book is an excellent book for people in management and marketing. For the Technometrics audience, this book does not have much flavor of physical, chemical, and engineering sciences. For example, partial least squares, a very popular method in Chemometrics, is discussed but not as detailed as other techniques in the book. Furthermore, due to the amount of materials covered in the book, it might be inappropriate for someone who is new to multivariate analysis.

The Grammar of Graphics

End-User Development: An Emerging Paradigm- Psychological Issues in End-User Programming- More Natural Programming Languages and Environments- What Makes End-User Development Tick? 13 Design Guidelines- An Integrated Software Engineering Approach for End-User Programmers- Component-Based Approaches to Tailorable Systems- Natural Development of Nomadic Interfaces Based on Conceptual Descriptions- End User Development of Web Applications- End-User Development: The Software Shaping Workshop Approach- Participatory Programming: Developing Programmable Bioinformatics Tools for End-Users- Challenges for End-User Development in Intelligent Environments- Fuzzy Rewriting- Breaking It Up: An Industrial Case Study of Component-Based Tailorable Software Design- End-User Development as Adaptive Maintenance- Supporting Collaborative Tailoring- EUD as Integration of Components Off-The-Shelf: The Role of Software Professionals Knowledge Artifacts- Organizational View of End-User Development- A Semiotic Framing for End-User Development- Meta-design: A Framework for the Future of End-User Development- Feasibility Studies for Programming in Natural Language- Future Perspectives in End-User Development

End user development

Detecting and repairing dirty data is one of the perennial challenges in data analytics, and failure to do so can result in inaccurate analytics and unreliable decisions. Over the past few years, there has been a surge of interest from both industry and academia on data cleaning problems including new abstractions, interfaces, approaches for scalability, and statistical techniques. To better understand the new advances in the field, we will first present a taxonomy of the data cleaning literature in which we highlight the recent interest in techniques that use constraints, rules, or patterns to detect errors, which we call qualitative data cleaning. We will describe the state-of-the-art techniques and also highlight their limitations with a series of illustrative examples. While traditionally such approaches are distinct from quantitative approaches such as outlier detection, we also discuss recent work that casts such approaches into a statistical estimation framework including: using Machine Learning to improve the efficiency and accuracy of data cleaning and considering the effects of data cleaning on statistical analysis.

Data Cleaning: Overview and Emerging Challenges

We investigate how to automatically recover visual encodings from a chart image, primarily using inferred text elements. We contribute an end-to-end pipeline which takes a bitmap image as input and...

Reverse‐Engineering Visualizations: Recovering Visual Encodings from Chart Images

Spreadsheets contain a huge amount of high-value data but do not observe a standard data model and thus are difficult to integrate. A large number of data integration tools exist, but they generally can only work on relational data. Existing systems for extracting relational data from spreadsheets are too labor intensive to support ad-hoc integration tasks, in which the correct extraction target is only learned during the course of user interaction.This paper introduces a system that automatically extracts relational data from spreadsheets, thereby enabling relational spreadsheet integration. The resulting integrated relational data can be queried directly or can be translated into RDF triples. When compared to standard techniques for spreadsheet data extraction on a set of 100 random Web spreadsheets, the system reduces the amount of human labor by 72% to 92%. In addition to the system design, we present the results of a general survey of more than 400,000 spreadsheets we downloaded from the Web, giving a novel view of how users organize their data in spreadsheets.

/pdf/automatic-web-spreadsheet-data-extraction-2fpodtpg8k.pdf

Automatic web spreadsheet data extraction

Spreadsheets contain valuable data on many topics. However, spreadsheets are difficult to integrate with other data sources. Converting spreadsheet data to the relational model would allow data analysts to use relational integration tools. We propose a two-phase semiautomatic system that extracts accurate relational metadata while minimizing user effort. Based on an undirected graphical model, our system enables downstream spreadsheet integration applications. First, the automatic extractor uses hints from spreadsheets' graphical style and recovered metadata to extract the spreadsheet data as accurately as possible. Second, the interactive repair identifies similar regions in distinct spreadsheets scattered across large spreadsheet corpora, allowing a user's single manual repair to be amortized over many possible extraction errors. Our experiments show that a human can obtain the accurate extraction with just 31% of the manual operations required by a standard classification based technique on two real-world datasets.

/pdf/integrating-spreadsheet-data-via-accurate-and-low-effort-117hf89rmu.pdf

Integrating spreadsheet data via accurate and low-effort extraction

A key challenge of entity set expansion is that multifaceted input seeds can lead to significant incoherence in the result set. In this paper, we present a novel solution to handling multifaceted seeds by combining existing user-generated ontologies with a novel word-similarity metric based on skip-grams. By blending the two resources we are able to produce sparse word ego-networks that are centered on the seed terms and are able to capture semantic equivalence among words. We demonstrate that the resulting networks possess internally-coherent clusters, which can be exploited to provide non-overlapping expansions, in order to reflect different semantic classes of the seeds. Empirical evaluation against state-of-the-art baselines shows that our solution, EgoSet, is able to not only capture multiple facets in the input query, but also generate expansions for each facet with higher precision.

/pdf/egoset-exploiting-word-ego-networks-and-user-generated-4528l5szuw.pdf

EgoSet: Exploiting Word Ego-networks and User-generated Ontology for Multifaceted Set Expansion

A dictionary --- a set of instances belonging to the same conceptual class --- is central to information extraction and is a useful primitive for many applications, including query log analysis and document categorization. Considerable work has focused on generating accurate dictionaries given a few example seeds, but methods to date cannot obtain long-tail (rare) items with high accuracy and recall. In this paper, we develop a novel method to construct high-quality dictionaries, especially for long-tail vocabularies, using just a few user-provided seeds for each topic. Our algorithm obtains long-tail (i.e., rare) items by building and executing high-quality webpage-specific extractors. We use webpage-specific structural and textual information to build more accurate per-page extractors in order to detect the long-tail items from a single webpage. These webpage-specific extractors are obtained via a co-training procedure using distantly-supervised training data. By aggregating the page-specific dictionaries of many webpages, Lyretail is able to output a high-quality comprehensive dictionary. Our experiments demonstrate that in long-tail vocabulary settings, we obtained a 17.3% improvement on mean average precision for the dictionary generation process, and a 30.7% improvement on F1 for the page-specific extraction, when compared to previous state-of-the-art methods.

/pdf/long-tail-vocabulary-dictionary-extraction-from-the-web-53pkplt2fj.pdf

Long-tail Vocabulary Dictionary Extraction from the Web

Spreadsheets have become a critical data management tool, but they lack explicit relational metadata, making it difficult to join or integrate data across multiple spreadsheets. Because spreadsheet data are widely available on a huge range of topics, a tool that allows easy spreadsheet integration would be hugely beneficial for a variety of users.We demonstrate that Senbazuru, a prototype spreadsheet database management system (SSDBMS), is able to extract relational information from spreadsheets. By doing so, it opens up opportunities for integration among spreadsheets and with other relational sources. Senbazuru allows users to search for relevant spreadsheets in a large corpus, probabilistically constructs a relational version of the data, and offers several relational operations over the resulting extracted data (including joins to other spreadsheet data). Our demonstration is available on two clients: a JavaScript-rich Web site and a touch interface on the iPad. During the demo, Senbazuru will allow VLDB participants to search spreadsheets, extract relational data from them, and apply relational operators such as select and join.

/pdf/senbazuru-a-prototype-spreadsheet-database-management-system-4bh6p3qsce.pdf

Zhe Chen

Papers

Automatic web spreadsheet data extraction

Integrating spreadsheet data via accurate and low-effort extraction

EgoSet: Exploiting Word Ego-networks and User-generated Ontology for Multifaceted Set Expansion

Long-tail Vocabulary Dictionary Extraction from the Web

Senbazuru: a prototype spreadsheet database management system