Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

Machine learning

DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against datasets derived from Wikipedia and to link other datasets on the Web to Wikipedia data. We describe the extraction of the DBpedia datasets, and how the resulting information is published on the Web for human-andmachine-consumption. We describe some emerging applications from the DBpedia community and show how website authors can facilitate DBpedia content within their sites. Finally, we present the current status of interlinking DBpedia with other open datasets on the Web and outline how DBpedia could serve as a nucleus for an emerging Web of open data.

/pdf/dbpedia-a-nucleus-for-a-web-of-open-data-64wla3p0zr.pdf

DBpedia: a nucleus for a web of open data

Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems.This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon's core services use to provide an "always-on" experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.

/pdf/dynamo-amazon-s-highly-available-key-value-store-2ozfk43ou4.pdf

Dynamo: amazon's highly available key-value store

In Distributed Algorithms, Nancy Lynch provides a blueprint for designing, implementing, and analyzing distributed algorithms. She directs her book at a wide audience, including students, programmers, system designers, and researchers.



Distributed Algorithms contains the most significant algorithms and impossibility results in the area, all in a simple automata-theoretic setting. The algorithms are proved correct, and their complexity is analyzed according to precisely defined complexity measures. The problems covered include resource allocation, communication, consensus among distributed processes, data consistency, deadlock detection, leader election, global snapshots, and many others.



The material is organized according to the system model-first by the timing model and then by the interprocess communication mechanism. The material on system models is isolated in separate chapters for easy reference.



The presentation is completely rigorous, yet is intuitive enough for immediate comprehension. This book familiarizes readers with important problems, algorithms, and impossibility results in the area: readers can then recognize the problems when they arise in practice, apply the algorithms to solve them, and use the impossibility results to determine whether problems are unsolvable. The book also provides readers with the basic mathematical tools for designing new algorithms and proving new impossibility results. In addition, it teaches readers how to reason carefully about distributed algorithms-to model them formally, devise precise specifications for their required behavior, prove their correctness, and evaluate their performance with realistic measures.


Table of Contents

1 Introduction 
2 Modelling I; Synchronous Network Model 
3 Leader Election in a Synchronous Ring 
4 Algorithms in General Synchronous Networks 
5 Distributed Consensus with Link Failures 
6 Distributed Consensus with Process Failures 
7 More Consensus Problems 
8 Modelling II: Asynchronous System Model 
9 Modelling III: Asynchronous Shared Memory Model 
10 Mutual Exclusion 
11 Resource Allocation 
12 Consensus 
13 Atomic Objects 
14 Modelling IV: Asynchronous Network Model 
15 Basic Asynchronous Network Algorithms 
16 Synchronizers 
17 Shared Memory versus Networks 
18 Logical Time 
19 Global Snapshots and Stable Properties 
20 Network Resource Allocation 
21 Asynchronous Networks with Process Failures 
22 Data Link Protocols 
23 Partially Synchronous System Models 
24 Mutual Exclusion with Partial Synchrony 
25 Consensus with Partial Synchrony

Distributed algorithms

Schema matching is a basic problem in many database application domains, such as data integration, E-business, data warehousing, and semantic query processing. In current implementations, schema matching is typically performed manually, which has significant limitations. On the other hand, previous research papers have proposed many techniques to achieve a partial automation of the match operation for specific application domains. We present a taxonomy that covers many of these existing approaches, and we describe the approaches in some detail. In particular, we distinguish between schema-level and instance-level, element-level and structure-level, and language-based and constraint-based matchers. Based on our classification we review some previous match implementations thereby indicating which part of the solution space they cover. We intend our taxonomy and review of past work to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component.

/pdf/a-survey-of-approaches-to-automatic-schema-matching-4n2e5qtrpt.pdf

A survey of approaches to automatic schema matching

This book is an introduction to the design and implementation of concurrency control and recovery mechanisms for transaction management in centralized and distributed database systems. Concurrency control and recovery have become increasingly important as businesses rely more and more heavily on their on-line data processing activities. For high performance, the system must maximize concurrency by multiprogramming transactions. But this can lead to interference between queries and updates, which concurrency control mechanisms must avoid. In addition, a satisfactory recovery system is necessary to ensure that inevitable transaction and database system failures do not corrupt the database.

Concurrency Control and Recovery in Database Systems

Schema matching is a critical step in many applications, such as XML message mapping, data warehouse loading, and schema integration. In this paper, we investigate algorithms for generic schema matching, outside of any particular data model or application. We first present a taxonomy for past solutions, showing that a rich range of techniques is available. We then propose a new algorithm, Cupid, that discovers mappings between schema elements based on their names, data types, constraints, and schema structure, using a broader set of techniques than past approaches. Some of our innovations are the integrated use of linguistic and structural matching, context-dependent matching of shared types, and a bias toward leaf structure where much of the schema content resides. After describing our algorithm, we present experimental results that compare Cupid to two other schema matching systems.

/pdf/generic-schema-matching-with-cupid-1vttr9k8si.pdf

Generic Schema Matching with Cupid

In this paper we survey, consolidate, and present the state of the art in distributed database concurrency control. The heart of our analysts is a decomposition of the concurrency control problem into two major subproblems: read-write and write-write synchronization. We describe a series of synchromzation techniques for solving each subproblem and show how to combine these techniques into algorithms for solving the entire concurrency control problem. Such algorithms are called "concurrency control methods." We describe 48 principal methods, including all practical algorithms that have appeared m the literature plus several new ones. We concentrate on the structure and correctness of concurrency control algorithms. Issues of performance are given only secondary treatment.

/pdf/concurrency-control-in-distributed-database-systems-21443t2pq0.pdf

Concurrency Control in Distributed Database Systems

he computing facilities of largescale enterprises are evolving into a utility, much like power and telecommunications. In the vision of an information utility, each knowledge worker has a desktop appliance that connects to the utility. The desktop appliance is a computer or computer-like device, such as a terminal, personal computer, workstation, word processor, or stock trader’s station. The utility itself is an enterprise-wide network of information services, including applications and databases, on the localarea and wide-area networks. Servers on the local-area network (LAN) typically support files and file-based applications, such as electronic mail, bulletin boards, document preparation, and printing. Local-area servers also support a directory service, to help a desktop user find other users and find and connect to services of interest. Servers on the wide-area network (WAN) typically support access to databases, such as corporate directories and electronic libraries, or transaction processing applications, such as purchasing, billing, and inventory control. Some servers are gateways to services offered outside the enterprise, such as travel or information retrieval services, news feeds (e.g., weather, stock prices), and electronic document interchange with business partners. In response to such connectivity, some businesses are redefining their business processes to use the utility to bridge formerly isolated component activities. In the long term, the utility should provide the information that people need when, where, and how they need it. Today’s enterprise computing facilities are only an approximation of the vision of an information utility. Most organizations have a wide variety of heterogeneous hardware systems, including personal computers, workstations, minicomputers, and mainframes. These systems run different operating systems (OSs) and rely on different network architectures. As a result, integration is difficult and its achievement uneven. For example, local-area servers are often isolated from the WAN. An appliance can access files and printers on its local server, but often not those on the servers of other LANs. Sometimes an application available on one local area server is not available on other servers, because other departments use servers

Philip A. Bernstein

Papers

Concurrency Control and Recovery in Database Systems

A survey of approaches to automatic schema matching

Generic Schema Matching with Cupid

Concurrency Control in Distributed Database Systems

Middleware: a model for distributed system services