scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Big Data in 2015"


Journal ArticleDOI
TL;DR: This work proposes a general-purpose framework, Petuum, that systematically addresses data- and model-parallel challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant, iterative-convergent algorithmic solutions.
Abstract: What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100 s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization strategies employ fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or even specialized graph-based execution that relies on graph representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of ML programs at scale. We propose a general-purpose framework, Petuum, that systematically addresses data- and model-parallel challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant, iterative-convergent algorithmic solutions. This presents unique opportunities for an integrative system design, such as bounded-error network synchronization and dynamic scheduling based on ML program structure. We demonstrate the efficacy of these system designs versus well-known implementations of modern ML algorithms, showing that Petuum allows ML programs to run in much less time and at considerably larger model sizes, even on modestly-sized compute clusters.

395 citations


Journal ArticleDOI
Yu Zheng1
TL;DR: High-level principles of each category of methods are introduced, and examples in which these techniques are used to handle real big data problems are given, to help a wide range of communities find a solution for data fusion in big data projects.
Abstract: Traditional data mining usually deals with data from a single domain. In the big data era, we face a diversity of datasets from different sources in different domains. These datasets consist of multiple modalities, each of which has a different representation, distribution, scale, and density. How to unlock the power of knowledge from multiple disparate (but potentially connected) datasets is paramount in big data research, essentially distinguishing big data from traditional data mining tasks. This calls for advanced techniques that can fuse knowledge from various datasets organically in a machine learning and data mining task. This paper summarizes the data fusion methodologies, classifying them into three categories: stage-based, feature level-based, and semantic meaning-based data fusion methods. The last category of data fusion methods is further divided into four groups: multi-view learning-based, similarity-based, probabilistic dependency-based, and transfer learning-based methods. These methods focus on knowledge fusion rather than schema mapping and data merging, significantly distinguishing between cross-domain data fusion and traditional data fusion studied in the database community. This paper does not only introduce high-level principles of each category of methods, but also give examples in which these techniques are used to handle real big data problems. In addition, this paper positions existing works in a framework, exploring the relationship and difference between different data fusion methods. This paper will help a wide range of communities find a solution for data fusion in big data projects.

356 citations


Journal ArticleDOI
TL;DR: In WeSed, a novel weakly weighted pairwise ranking loss is effectively utilized to handle weakly labeled images, while a triplet similarity loss is employed to harness unlabeled images to train deep convolutional neural network with images from social networks.
Abstract: In this paper, we study leveraging both weakly labeled images and unlabeled images for multi-label image annotation. Motivated by the recent advance in deep learning, we propose an approach called we akly se mi-supervised d eep learning for multi-label image annotation (WeSed). In WeSed, a novel weakly weighted pairwise ranking loss is effectively utilized to handle weakly labeled images, while a triplet similarity loss is employed to harness unlabeled images. WeSed enables us to train deep convolutional neural network (CNN) with images from social networks where images are either only weakly labeled with several labels or unlabeled. We also design an efficient algorithm to sample high-quality image triplets from large image datasets to fine-tune the CNN. WeSed is evaluated on benchmark datasets for multi-label annotation. The experiments demonstrate the effectiveness of our proposed approach and show that the leverage of the weakly labeled images and unlabeled images leads to a significantly better performance.

107 citations


Journal ArticleDOI
TL;DR: This paper investigates how to establish the relationship between semantic concepts based on the large-scale realworld click data from image commercial engine, which is a challenging topic because the click data suffers from the noise such as typos.
Abstract: In this paper, we investigate how to establish the relationship between semantic concepts based on the large-scale real-world click data from image commercial engine, which is a challenging topic because the click data suffers from the noise such as typos, the same concept with different queries, etc. We first define five specific relationships between concepts. We then extract some concept relationship features in textual and visual domain to train the concept relationship models. The relationship of each pair of concepts will thus be classified into one of the five special relationships. We study the efficacy of the conceptual relationships by applying them to augment imperfect image tags, i.e., improve representative power. We further employ a sophisticated hashing approach to transform augmented image tags into binary codes, which are subsequently used for content-based image retrieval task. Experimental results on NUS-WIDE dataset demonstrate the superiority of our proposed approach as compared to state-of-the-art methods.

100 citations


Journal ArticleDOI
TL;DR: This work proposes a novel unsupervised hashing approach, namely robust discrete hashing (RDSH), to facilitate large-scale semantic indexing of image data and integrates a flexible `2;p loss with nonlinear kernel embedding to adapt to different noise levels.
Abstract: In big data era, the ever-increasing image data has posed significant challenge on modern image retrieval. It is of great importance to index images with semantic keywords efficiently and effectively, especially confronted with fast-evolving property of the web. Learning-based hashing has shown its power in handling large-scale high-dimensional applications, such as image retrieval. Existing solutions normally separate the process of learning binary codes and hash functions into two independent stages to bypass challenge of the discrete constraints on binary codes. In this work, we propose a novel unsupervised hashing approach, namely robust discrete hashing (RDSH), to facilitate large-scale semantic indexing of image data. Specifically, RDSH simultaneously learns discrete binary codes as well as robust hash functions within a unified model. In order to suppress the influence of unreliable binary codes and learn robust hash functions, we also integrate a flexible $\ell _{2,p}$ loss with nonlinear kernel embedding to adapt to different noise levels. Finally, we devise an alternating algorithm to efficiently optimize RDSH model. Given a test image, we first conduct $r$ -nearest-neighbor search based on Hamming distance of binary codes, and then propagate semantic keywords of neighbors to the test image. Extensive experiments have been conducted on various real-world image datasets to show its superiority to the state-of-the-arts in large-scale semantic indexing.

85 citations


Journal ArticleDOI
TL;DR: This paper proposes a practical method, Shadow Coding, to preserve the privacy in data transmission and ensure the recovery in data collection, which achieves privacy preserving computation in a data-recoverable, efficient, and scalable way.
Abstract: Data collection is required to be safe and efficient considering both data privacy and system performance. In this paper, we study a new problem: distributed data sharing with privacy-preserving requirements. Given a data demander requesting data from multiple distributed data providers, the objective is to enable the data demander to access the distributed data without knowing the privacy of any individual provider. The problem is challenged by two questions: how to transmit the data safely and accurately; and how to efficiently handle data streams? As the first study, we propose a practical method, Shadow Coding, to preserve the privacy in data transmission and ensure the recovery in data collection, which achieves privacy preserving computation in a data-recoverable, efficient, and scalable way. We also provide practical techniques to make Shadow Coding efficient and safe in data streams. Extensive experimental study on a large-scale real-life dataset offers insight into the performance of our schema. The proposed schema is also implemented as a pilot system in a city to collect distributed mobile phone data.

34 citations


Journal ArticleDOI
TL;DR: This work proposes efficient methods for processing RDF using dynamic data re-partitioning to enable rapid analysis of large datasets and proposes methods to replace some secondary indexes with distributed filters, so as to decrease memory consumption.
Abstract: Distributed RDF data management systems become increasingly important with the growth of the Semantic Web. Regardless, current methods meet performance bottlenecks either on data loading or querying when processing large amounts of data. In this work, we propose efficient methods for processing RDF using dynamic data re-partitioning to enable rapid analysis of large datasets. Our approach adopts a two-tier index architecture on each computation node: (1) a lightweight primary index, to keep loading times low, and (2) a series of dynamic, multi-level secondary indexes, calculated as a by-product of query execution, to decrease or remove inter-machine data movement for subsequent queries that contain the same graph patterns. In addition, we propose methods to replace some secondary indexes with distributed filters, so as to decrease memory consumption. Experimental results on a commodity cluster with 16 nodes show that the method presents good scale-out characteristics and can indeed vastly improve loading speeds while remaining competitive in terms of performance. Specifically, our approach can load a dataset of 1.1 billion triples at a rate of 2.48 million triples per second and provide competitive performance to RDF-3X and 4store for expensive queries.

20 citations


Journal ArticleDOI
TL;DR: A text representation framework is presented by harnessing the power of semantic knowledge bases, i.e., Wikipedia and Wordnet, to organize the large amount of messages into clusters with meaningful cluster labels, thus provide an overview of the content to fulfill users' information needs.
Abstract: The explosive popularity of microblogging services produce a large volume of microblogging messages. It presents great difficulties for a user to quickly gauge his/her followees’ opinions when the user interface is overwhelmed by a large number of messages. Useful information is buried in disorganized, incomplete, and unstructured text messages. We propose to organize the large amount of messages into clusters with meaningful cluster labels, thus provide an overview of the content to fulfill users’ information needs. Clustering and labeling of microblogging messages are challenging because that the length of the messages are much shorter than conventional text documents. They usually cannot provide sufficient term co-occurrence information for capturing their semantic associations. As a result, traditional text representation models tend to yield unsatisfactory performance. In this paper, we present a text representation framework by harnessing the power of semantic knowledge bases, i.e., Wikipedia and Wordnet. The originally uncorrelated texts are connected with the semantic representation, thus it enhances the performance of short text clustering and labeling. The experimental results on Twitter and Facebook datasets demonstrate the superior performance of our framework in handling noisy and short microblogging messages.

13 citations


Journal ArticleDOI
TL;DR: The first issue of the IEEE Transactions on Big Data, with Qiang Yang as the founding editor-in-chief, is published, which will provide cross disciplinary innovative research ideas and applications results for big data including novel theory, algorithms and applications.
Abstract: IT is with great excitement that we are publishing the first issue of the IEEE Transactions on Big Data. Big data has been an important topic for the IEEE for many years and big data is already transforming our world. Now is the time to publish a high quality, peer reviewed, multi-disciplinary journal that will document research that will help us navigate the opportunities and potential pitfalls, both ethical and technical, of big data. As described in our charter, IEEE TBDATA “will provide cross disciplinary innovative research ideas and applications results for big data including novel theory, algorithms and applications. Research areas for big data include, but are not restricted to, big data analytics, big data visualization, big data curation and management, big data semantics, big data infrastructure, big data standards, big data performance analyses, intelligence from big data, scientific discovery from big data security, privacy, and legal issues specific to big data.” IEEE TBDATA is fortunate to have both financial and technical sponsorship from eight IEEE societies and one council (IEEE Computer Society, IEEE Communications Society, IEEE Computational Intelligence Society, IEEE Sensors Council, IEEE Consumer Electronics Society, IEEE Signal Processing Society, IEEE Systems, Man, and Cybernetics Society, IEEE Systems Council, IEEE Vehicular Technology Society, IEEE Control Systems Society, IEEE Power and Energy Society, and IEEE Biometrics Society), with the IEEE Computer Society serving as the administrative partner. Each sponsor nominates an associate editor and the collection of sponsors behind IEEE TBDATA illustrates its importance to the IEEE community and provides both technical depth and breadth to our editorial board, which ensures the highest quality and relevance of our content. I am especially pleased to introduce Qiang Yang as our founding editor-in-chief. Professor Yang’s biography speaks for itself, but I would like to express the gratitude of the Steering Committee to Qiang for his service. We are very fortunate to have such a uniquely qualified founding EIC. Qiang’s qualifications in research and management and editorial experience made him a unanimous selection as EIC, and he has been a pleasure to work with. I would like to encourage our readership to follow the highest quality research in big data by subscribing to IEEE TBDATA and to contribute articles in response to both general and special issue calls for papers. Information on subscribing and submitting can be found at our website: http://www.computer.org/web/tbd.

9 citations


Journal ArticleDOI
TL;DR: A preference learning model is proposed to quantitatively study and formulate the best image search result list identification problem and a set of valuable preference learning related features is proposed by exploring the visual characters of returned images.
Abstract: Image retrieval plays an increasingly important role in our daily lives. There are many factors which affect the quality of image search results, including chosen search algorithms, ranking functions, and indexing features. Applying different settings for these factors generates search result lists with varying levels of quality. However, no setting can always perform optimally for all queries. Therefore, given a set of search result lists generated by different settings, it is crucial to automatically determine which result list is the best in order to present it to users. This paper aims to solve this problem and makes four main innovations. First, a preference learning model is proposed to quantitatively study and formulate the best image search result list identification problem. Second, a set of valuable preference learning related features is proposed by exploring the visual characters of returned images. Third, a query-dependent preference learning model is further designed for building a more precise and query-specific model. Fourth, the proposed approach has been tested on a variety of applications including reranking ability assessment, optimal search engine selection, and synonymous query suggestion. Extensive experimental results on three image search datasets demonstrate the effectiveness and promising potential of the proposed method.

8 citations


Journal ArticleDOI
TL;DR: Experimental results show that LS-AMS can greatly improve query performance without increasing the update cost and improve the self-adaptability in dynamic environment.
Abstract: Indexing microblogs for realtime search is challenging, because new microblogs are created at tremendous speed, and user query requests keep constantly changing. To guarantee user obtain complete query results, micro-blogging site maintains huge indices which leads to index fragmentation or extra merging overhead during realtime search. This paper proposes an efficient Log-Structured index structure with Adaptive Merging Strategy (LS-AMS) for realtime search on microblogs. LS-AMS structure consists of an inverted index buffer and a sequence of dynamically adjustable index packages with exponentially increasing sizes. These index packages manage their inverted indices using adaptive merging strategy, which can reduce the merging overhead to improve query performance and can adjust the index structure based on environmental factors, such as the arrival rate of query requests and new microblogs. Experimental results show that LS-AMS can greatly improve query performance without increasing the update cost and improve the self-adaptability in dynamic environment.

Journal ArticleDOI
TL;DR: Experimental results demonstrate that the proposed CCH algorithm to learn discrete binary hash codes outperforms state-of-the-art hashing methods in both image retrieval and classification tasks, especially with short binary codes.
Abstract: Learning based hashing techniques have attracted broad research interests in the Big Media research area. They aim to learn compact binary codes which can preserve semantic similarity in the Hamming embedding. However, the discrete constraints imposed on binary codes typically make hashing optimizations very challenging. In this paper, we present a code consistent hashing ( CCH ) algorithm to learn discrete binary hash codes. To form a simple yet efficient hashing objective function, we introduce a new code consistency constraint to leverage discriminative information and propose to utilize the Hadamard code which favors an information-theoretic criterion as the class prototype. By keeping the discrete constraint and introducing an orthogonal constraint, our objective function can be minimized efficiently. Experimental results on three benchmark datasets demonstrate that the proposed CCH outperforms state-of-the-art hashing methods in both image retrieval and classification tasks, especially with short binary codes.

Journal ArticleDOI
TL;DR: The inaugural issue of the IEEE Transactions on Big Data (IEEE TBDATA) is presented, which will serve as a forum for the Big Data community to exchange ideas and report its successes, and will cover the following broad areas.
Abstract: IT is my great pleasure to present this inaugural issue of the IEEE Transactions on Big Data (IEEE TBDATA). Big Data is a new field that encompasses multiple disciplines and impacts a wide range of sectors of our society. Its rapid rise in recent years can be attributed to several technological advances. The increasing availability of sensors made data generation and collection easier and cheaper. Advances in telecommunications technologies and services facilitated the massive exchange of data among client devices, data centers and clouds. The fast reduction in data storage and processing costs gives rise to fast growth in increasing computational power. As a result, novel applications are widely found that span across diverse fields as never before. Big Data can be characterized by its extraordinary characteristics along several dimensions. The first of the dimensions is the size of data. Data sets grow in sizes partly because they are being gathered by cheaper and easier-to operate information sensing mobile devices. This is referred to as Volume by industry leaders [1]. Other dimensions are equally important, including Big Data’s Variety (the data types are many and heterogeneous), Velocity (the speed is fast in which the data is generated and processed to meet the demands) and Veracity (the quality of the data being captured can vary greatly). These complexities pose a major challenge as well as new opportunity for today’s information technology communities. The term Big Data goes well beyond the data itself; it is also often used to refer to a new methodology to approach our problems and solutions. As pointed out in [2], our scientific advances fall in different stages, or paradigms, as the human race moves forward. The first paradigm is known as the empirical stage, which happened when scientific discovery was mainly driven by recording empirical observations through tools such as telescopes. The second stage was when theories were introduced to summarize the observations and make predictions. Scientists such as Newton used mathematics and physical laws to build models to explain the empirical observations. The third paradigm came as a result of the arrival of digital computers, when large-scale simulations were used to mimic the dynamics of nature. With the arrival of the Big Data, we are at the beginning of the fourth paradigm of scientific discovery, when knowledge discovery is done through hypothesis testing driven by the availability of the massive digital data. In this fourth-paradigm way of scientific thinking, data becomes a first-class citizen, giving birth to the particular practice of knowledge discovery known as Data Science. Thus, Big Data is situated at the cross roads of many disciplines, and this new IEEE Transactions on Big Data aspires to lead this technological revolution to the next level. The journal will serve as a forum for the Big Data community to exchange ideas and report its successes. In particular, the journal will cover the following broad areas:

Journal ArticleDOI
TL;DR: The articles in this special section aim at presenting the latest developments, trends, and solutions of Big Data analytics on the web.
Abstract: The articles in this special section aim at presenting the latest developments, trends, and solutions of Big Data analytics on the web.