Journal•ISSN: 2332-7790

IEEE Transactions on Big Data

IEEE Computer Society

About: IEEE Transactions on Big Data is an academic journal published by IEEE Computer Society. The journal publishes majorly in the area(s): Computer science & Big data. It has an ISSN identifier of 2332-7790. Over the lifetime, 525 publications have been published receiving 13488 citations. The journal is also known as: Big data & Transactions on big data.

...read moreread less

Topics: Computer science, Big data, Cloud computing, Cluster analysis, Artificial intelligence ...read more

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Billion-Scale Similarity Search with GPUs

[...]

Jeff Johnson¹, Matthijs Douze¹, Hervé Jégou¹•Institutions (1)

Facebook¹

01 Jul 2021-IEEE Transactions on Big Data

TL;DR: This paper proposes a novel design for an inline-formula that enables the construction of a high accuracy, brute-force, approximate and compressed-domain search based on product quantization, and applies it in different similarity search scenarios.

...read moreread less

Abstract: Similarity search finds application in database systems handling complex data such as images or videos, which are typically represented by high-dimensional features and require specific indexing structures. This paper tackles the problem of better utilizing GPUs for this task. While GPUs excel at data parallel tasks such as distance computation, prior approaches in this domain are bottlenecked by algorithms that expose less parallelism, such as $k$ k -min selection, or make poor use of the memory hierarchy. We propose a novel design for $k$ k -selection. We apply it in different similarity search scenarios, by optimizing brute-force, approximate and compressed-domain search based on product quantization. In all these setups, we outperform the state of the art by large margins. Our implementation operates at up to 55 percent of theoretical peak performance, enabling a nearest neighbor implementation that is 8.5 × faster than prior GPU state of the art. It enables the construction of a high accuracy $k$ k -NN graph on 95 million images from the Yfcc100M dataset in 35 minutes, and of a graph connecting 1 billion vectors in less than 12 hours on 4 Maxwell Titan X GPUs. We have open-sourced our approach for the sake of comparison and reproducibility.

...read moreread less

1,050 citations

Journal Article•DOI•

Network Representation Learning: A Survey

[...]

Daokun Zhang¹, Jie Yin², Xingquan Zhu³, Chengqi Zhang¹•Institutions (3)

University of Technology, Sydney¹, University of Sydney², Florida Atlantic University³

01 Mar 2020-IEEE Transactions on Big Data

TL;DR: Network representation learning as discussed by the authors is a new learning paradigm to embed network vertices into a low-dimensional vector space, by preserving network topology structure, vertex content, and other side information.

...read moreread less

Abstract: With the widespread use of information technologies, information networks are becoming increasingly popular to capture complex relationships across various disciplines, such as social networks, citation networks, telecommunication networks, and biological networks. Analyzing these networks sheds light on different aspects of social life such as the structure of societies, information diffusion, and communication patterns. In reality, however, the large scale of information networks often makes network analytic tasks computationally expensive or intractable. Network representation learning has been recently proposed as a new learning paradigm to embed network vertices into a low-dimensional vector space, by preserving network topology structure, vertex content, and other side information. This facilitates the original network to be easily handled in the new vector space for further analysis. In this survey, we perform a comprehensive review of the current literature on network representation learning in the data mining and machine learning field. We propose new taxonomies to categorize and summarize the state-of-the-art network representation learning techniques according to the underlying learning mechanisms, the network information intended to preserve, as well as the algorithmic designs and methodologies. We summarize evaluation protocols used for validating network representation learning including published benchmark datasets, evaluation methods, and open source algorithms. We also perform empirical studies to compare the performance of representative algorithms on common datasets, and analyze their computational complexity. Finally, we suggest promising research directions to facilitate future study.

...read moreread less

494 citations

Journal Article•DOI•

Petuum: A New Platform for Distributed Machine Learning on Big Data

[...]

Eric P. Xing¹, Qirong Ho², Wei Dai¹, Jin-Kyu Kim¹, Jinliang Wei¹, Seunghak Lee¹, Xun Zheng¹, Pengtao Xie¹, Abhimanu Kumar¹, Yaoliang Yu¹ - Show less +6 more•Institutions (2)

Carnegie Mellon University¹, Agency for Science, Technology and Research²

01 Jun 2015-IEEE Transactions on Big Data

TL;DR: This work proposes a general-purpose framework, Petuum, that systematically addresses data- and model-parallel challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant, iterative-convergent algorithmic solutions.

...read moreread less

Abstract: What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100 s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization strategies employ fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or even specialized graph-based execution that relies on graph representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of ML programs at scale. We propose a general-purpose framework, Petuum, that systematically addresses data- and model-parallel challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant, iterative-convergent algorithmic solutions. This presents unique opportunities for an integrative system design, such as bounded-error network synchronization and dynamic scheduling based on ML program structure. We demonstrate the efficacy of these system designs versus well-known implementations of modern ML algorithms, showing that Petuum allows ML programs to run in much less time and at considerably larger model sizes, even on modestly-sized compute clusters.

...read moreread less

395 citations

Journal Article•DOI•

Methodologies for Cross-Domain Data Fusion: An Overview

[...]

Yu Zheng¹•Institutions (1)

Microsoft¹

01 Mar 2015-IEEE Transactions on Big Data

TL;DR: High-level principles of each category of methods are introduced, and examples in which these techniques are used to handle real big data problems are given, to help a wide range of communities find a solution for data fusion in big data projects.

...read moreread less

Abstract: Traditional data mining usually deals with data from a single domain. In the big data era, we face a diversity of datasets from different sources in different domains. These datasets consist of multiple modalities, each of which has a different representation, distribution, scale, and density. How to unlock the power of knowledge from multiple disparate (but potentially connected) datasets is paramount in big data research, essentially distinguishing big data from traditional data mining tasks. This calls for advanced techniques that can fuse knowledge from various datasets organically in a machine learning and data mining task. This paper summarizes the data fusion methodologies, classifying them into three categories: stage-based, feature level-based, and semantic meaning-based data fusion methods. The last category of data fusion methods is further divided into four groups: multi-view learning-based, similarity-based, probabilistic dependency-based, and transfer learning-based methods. These methods focus on knowledge fusion rather than schema mapping and data merging, significantly distinguishing between cross-domain data fusion and traditional data fusion studied in the database community. This paper does not only introduce high-level principles of each category of methods, but also give examples in which these techniques are used to handle real big data problems. In addition, this paper positions existing works in a framework, exploring the relationship and difference between different data fusion methods. This paper will help a wide range of communities find a solution for data fusion in big data projects.

...read moreread less

356 citations

Journal Article•DOI•

Activity-Based Human Mobility Patterns Inferred from Mobile Phone Data: A Case Study of Singapore

[...]

Shan Jiang¹, Joseph Ferreira¹, Marta C. González¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jun 2017-IEEE Transactions on Big Data

TL;DR: This research provides an innovative data mining framework that synthesizes the state-of-the-art techniques in extracting mobility patterns from raw mobile phone CDR data, and design a pipeline that can translate the massive and passive mobile phone records to meaningful spatial human mobility patterns readily interpretable for urban and transportation planning purposes.

...read moreread less

Abstract: In this study, with Singapore as an example, we demonstrate how we can use mobile phone call detail record (CDR) data, which contains millions of anonymous users, to extract individual mobility networks comparable to the activity-based approach. Such an approach is widely used in the transportation planning practice to develop urban micro simulations of individual daily activities and travel; yet it depends highly on detailed travel survey data to capture individual activity-based behavior. We provide an innovative data mining framework that synthesizes the state-of-the-art techniques in extracting mobility patterns from raw mobile phone CDR data, and design a pipeline that can translate the massive and passive mobile phone records to meaningful spatial human mobility patterns readily interpretable for urban and transportation planning purposes. With growing ubiquitous mobile sensing, and shrinking labor and fiscal resources in the public sector globally, the method presented in this research can be used as a low-cost alternative for transportation and planning agencies to understand the human activity patterns in cities, and provide targeted plans for future sustainable development.

...read moreread less

351 citations

Collapse

Performance

Metrics

707

Papers

14,011

Citations

No. of papers from the Journal in previous years
Year	Papers
2023	173
2022	133
2021	81
2020	107
2019	70
2018	52