Matrix Factorization Techniques for Recommender Systems

Big Data computing and clouds

We provide a novel algorithm to approximately factor large matrices with millions of rows, millions of columns, and billions of nonzero elements. Our approach rests on stochastic gradient descent (SGD), an iterative stochastic optimization algorithm. We first develop a novel "stratified" SGD variant (SSGD) that applies to general loss-minimization problems in which the loss function can be expressed as a weighted sum of "stratum losses." We establish sufficient conditions for convergence of SSGD using results from stochastic approximation theory and regenerative process theory. We then specialize SSGD to obtain a new matrix-factorization algorithm, called DSGD, that can be fully distributed and run on web-scale datasets using, e.g., MapReduce. DSGD can handle a wide variety of matrix factorizations. We describe the practical techniques used to optimize performance in our DSGD implementation. Experiments suggest that DSGD converges significantly faster and has better scalability properties than alternative algorithms.

/pdf/large-scale-matrix-factorization-with-distributed-stochastic-zf7dr6dhxf.pdf

Large-scale matrix factorization with distributed stochastic gradient descent

A prominent parallel data processing tool MapReduce is gaining significant momentum from both industry and academia as the volume of data to analyze grows rapidly. While MapReduce is used in many areas where massive data analysis is required, there are still debates on its performance, efficiency per node, and simple abstraction. This survey intends to assist the database and open source communities in understanding various technical aspects of the MapReduce framework. In this survey, we characterize the MapReduce framework and discuss its inherent pros and cons. We then introduce its optimization strategies reported in the recent literature. We also discuss the open issues and challenges raised on parallel data analysis with MapReduce.

https://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf

Parallel data processing with MapReduce: a survey

Many distributed storage systems achieve high data access throughput via partitioning and replication, each system with its own advantages and tradeoffs. In order to achieve high scalability, however, today's systems generally reduce transactional support, disallowing single transactions from spanning multiple partitions. Calvin is a practical transaction scheduling and data replication layer that uses a deterministic ordering guarantee to significantly reduce the normally prohibitive contention costs associated with distributed transactions. Unlike previous deterministic database system prototypes, Calvin supports disk-based storage, scales near-linearly on a cluster of commodity machines, and has no single point of failure. By replicating transaction inputs rather than effects, Calvin is also able to support multiple consistency levels---including Paxos-based strong consistency across geographically distant replicas---at no cost to transactional throughput.

/pdf/calvin-fast-distributed-transactions-for-partitioned-23kej1og2m.pdf

Calvin: fast distributed transactions for partitioned database systems

An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for scoring the similarity between two strings, and it is increasingly being used in complex queries. An immediate challenge faced by current database optimizers is to find accurate and efficient methods for estimating the selectivity of cosine similarity predicates. To the best of our knowledge, there are no known methods for this problem. In this paper, we present the first approach for estimating the selectivity of tf.idf based cosine similarity predicates. We evaluate our approach on three different real datasets and show that our method often produces estimates that are within 40% of the actual selectivity.

Estimating the selectivity of tf-idf based cosine similarity predicates

Spinnaker is an experimental datastore that is designed to run on a large cluster of commodity servers in a single datacenter It features key-based range partitioning, 3-way replication, and a transactional get-put API with the option to choose either strong or timeline consistency on reads This paper describes Spinnaker's Paxos-based replication protocol The use of Paxos ensures that a data partition in Spinnaker will be available for reads and writes as long a majority of its replicas are alive Unlike traditional master-slave replication, this is true regardless of the failure sequence that occurs We show that Paxos replication can be competitive with alternatives that provide weaker consistency guarantees Compared to an eventually consistent datastore, we show that Spinnaker can be as fast or even faster on reads and only 5% to 10% slower on writes

/pdf/using-paxos-to-build-a-scalable-consistent-and-highly-3vkvy55cie.pdf

Using Paxos to build a scalable, consistent, and highly available datastore

Spinnaker is an experimental datastore that is designed to run on a large cluster of commodity servers in a single datacenter. It features key-based range partitioning, 3-way replication, and a transactional get-put API with the option to choose either strong or timeline consistency on reads. This paper describes Spinnaker's Paxos-based replication protocol. The use of Paxos ensures that a data partition in Spinnaker will be available for reads and writes as long a majority of its replicas are alive. Unlike traditional master-slave replication, this is true regardless of the failure sequence that occurs. We show that Paxos replication can be competitive with alternatives that provide weaker consistency guarantees. Compared to an eventually consistent datastore, we show that Spinnaker can be as fast or even faster on reads and only 5% to 10% slower on writes.

Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore

Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However, translating these techniques to a Map-Reduce implementation such as Hadoop presents unique challenges that can lead to new design choices. This paper describes how column-oriented storage techniques can be incorporated in Hadoop in a way that preserves its popular programming APIs.We show that simply using binary storage formats in Hadoop can provide a 3x performance boost over the naive use of text files. We then introduce a column-oriented storage format that is compatible with the replication and scheduling constraints of Hadoop and show that it can speed up MapReduce jobs on real workloads by an order of magnitude. We also show that dealing with complex column types such as arrays, maps, and nested records, which are common in MapReduce jobs, can incur significant CPU overhead. Finally, we introduce a novel skip list column format and lazy record construction strategy that avoids deserializing unwanted records to provide an additional 1.5x performance boost. Experiments on a real intranet crawl are used to show that our column-oriented storage techniques can improve the performance of the map phase in Hadoop by as much as two orders of magnitude.

/pdf/column-oriented-storage-techniques-for-mapreduce-3yjhe8idap.pdf

Column-oriented storage techniques for MapReduce

Today's enterprise databases are large and complex, often relating hundreds of entities. Enabling ordinary users to query such databases and derive value from them has been of great interest in database research. Today, keyword search over relational databases allows users to find pieces of information without having to write complicated SQL queries. However, in order to compute even simple aggregates, a user is required to write a SQL statement and can no longer use simple keywords. This not only requires the ordinary user to learn SQL, but also to learn the schema of the complex database in detail in order to correctly construct the required query. This greatly limits the options of the user who wishes to examine a database in more depth.As a solution to this problem, we propose a framework called SQAK1 (SQL Aggregates using Keywords) that enables users to pose aggregate queries using simple keywords with little or no knowledge of the schema. SQAK provides a novel and exciting way to trade-off some of the expressive power of SQL in exchange for the ability to express a large class of aggregate queries using simple keywords. SQAK accomplishes this by taking advantage of the data in the database and the schema (tables, attributes, keys, and referential constraints). SQAK does not require any changes to the database engine and can be used with any existing database. We demonstrate using several experiments that SQAK is effective and can be an enormously powerful tool for ordinary users.

/pdf/sqak-doing-more-with-keywords-3vrn5uxbc2.pdf

Sandeep Tata

Papers

Estimating the selectivity of tf-idf based cosine similarity predicates

Using Paxos to build a scalable, consistent, and highly available datastore

Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore

Column-oriented storage techniques for MapReduce

SQAK: doing more with keywords