TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. Tensor-Flow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom-designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with a focus on training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model and demonstrate the compelling performance that TensorFlow achieves for several real-world applications.

/pdf/tensorflow-a-system-for-large-scale-machine-learning-10wgdjhl6k.pdf

TensorFlow: a system for large-scale machine learning

Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and more succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the different directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.

/pdf/anomaly-detection-a-survey-1x33g9m0a3.pdf

Anomaly detection: A survey

TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with particularly strong support for training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model in contrast to existing systems, and demonstrate the compelling performance that TensorFlow achieves for several real-world applications.

TensorFlow: A system for large-scale machine learning

XML is quickly becoming the de facto standard for data exchange over the Internet. This is creating a new set of data management requirements involving XML, such as the need to store and query XML documents. Researchers have proposed using relational database systems to satisfy these requirements by devising ways to "shred" XML documents into relations, and translate XML queries into SQL queries over these relations. However, a key issue with such an approach, which has largely been ignored in the research literature, is how (and whether) the ordered XML data model can be efficiently supported by the unordered relational data model. This paper shows that XML's ordered data model can indeed be efficiently supported by a relational database system. This is accomplished by encoding order as a data value. We propose three order encoding methods that can be used to represent XML order in the relational data model, and also propose algorithms for translating ordered XPath expressions into SQL using these encoding methods. Finally, we report the results of an experimental study that investigates the performance of the proposed order encoding methods on a workload of ordered XML queries and updates.

http://db.ucsd.edu/static/cse232b-archive/papers/tatarinov02.pdf

Storing and querying ordered XML using a relational database system

In this paper, we review the background and state-of-the-art of big data. We first introduce the general background of big data and review related technologies, such as could computing, Internet of Things, data centers, and Hadoop. We then focus on the four phases of the value chain of big data, i.e., data generation, data acquisition, data storage, and data analysis. For each phase, we introduce the general background, discuss the technical challenges, and review the latest advances. We finally examine the several representative applications of big data, including enterprise management, Internet of Things, online social networks, medial applications, collective intelligence, and smart grid. These discussions aim to provide a comprehensive overview and big-picture to readers of this exciting area. This survey is concluded with a discussion of open problems and future directions.

Big Data: A Survey

System, method and computer program products for probing a hash table by receiving a compressed input key, computing a hash value for the compressed input key and probing one or more buckets in a hash table for a match. Each bucket includes multiple chunks. For a bucket in the hash table, chunks are searched in that bucket by comparing in parallel the hash value with multiple slots in each chunk, such that if a value in a chunk equals the hash value of the compressed input key, then a match is declared and a vector is returned with a significant bit of a matching slot in the bucket set to a value. If a value stored in a chunk corresponds to an empty slot, then a mismatch is declared, and the vector is returned as the result with the significant bit of a matching empty slot set to the value.

Systems, methods and computer program products for probing a hash table for improved latency and scalability in a processing system

Additional work completed after the publication of "XTABLES: Bridging relational technology and XML" (IBM Systems Journal41, No. 4, 2002) indicated that a number of figures containing XQUERY and XML commands in that paper require modifications or additions in order to be correct and complete. We present the modified figures and other queries, along with SQL commands that can be used to generate the sample data described in the original paper and those produced by the modified queries.

Technical note-- XTABLES: Bridging relational technology and XML

Systems, methods and articles of manufacture are disclosed for synchronizing a primary data system with an auxiliary data system that processes data for the primary data system. In one embodiment, how current the primary data system and the auxiliary data system are may be determined. Requests sent from the primary data system that were not processed by the auxiliary data system may be determined. The requests may be resent to the auxiliary data system for processing.

Synchronizing an auxiliary data system with a primary data system

This paper describes Jaql, a declarative scripting language for analyzing large semistructured datasets in parallel using Hadoop's MapReduce framework. Jaql is currently used in IBM's InfoSphere BigInsights [5] and Cognos Consumer Insight [9] products. Jaql's design features are: (1) a flexible data model, (2) reusability, (3) varying levels of abstraction, and (4) scalability. Jaql's data model is inspired by JSON and can be used to represent datasets that vary from flat, relational tables to collections of semistructured documents. A Jaql script can start without any schema and evolve over time from a partial to a rigid schema. Reusability is provided through the use of higher-order functions and by packaging related functions into modules. Most Jaql scripts work at a high level of abstraction for concise specification of logical operations (e.g., join), but Jaql's notion of physical transparency also provides a lower level of abstraction if necessary. This allows users to pin down the evaluation plan of a script for greater control or even add new operators. The Jaql compiler automatically rewrites Jaql scripts so they can run in parallel on Hadoop. In addition to describing Jaql's design, we present the results of scale-up experiments on Hadoop running Jaql scripts for intranet data analysis and log processing.

Jaql: A Scripting Language for Large Scale Semistructured Data Analysis

Maintaining strict static score order of inverted lists is a heuristic used by search engines to improve the quality of query results when the entire inverted lists cannot be processed. This heuristic, however, increases the cost of index generation and requires complex index build algorithms. In this paper, we study a new index organization based on static score bucketing. We show that this new technique significantly improves in index build performance while having minimal impact on the quality of search results.

/pdf/static-score-bucketing-in-inverted-indexes-4npewqv5k6.pdf

Eugene J. Shekita

Papers

Systems, methods and computer program products for probing a hash table for improved latency and scalability in a processing system

Technical note-- XTABLES: Bridging relational technology and XML

Synchronizing an auxiliary data system with a primary data system

Jaql: A Scripting Language for Large Scale Semistructured Data Analysis

Static score bucketing in inverted indexes