scispace - formally typeset
Search or ask a question
Book ChapterDOI

Mining of Massive Datasets: Data Mining

About: The article was published on 2011-01-01. It has received 138 citations till now.
Citations
More filters
Journal ArticleDOI
Junfei Qiu1, Qihui Wu1, Guoru Ding1, Yuhua Xu1, Shuo Feng1 
TL;DR: A literature survey of the latest advances in researches on machine learning for big data processing finds some promising learning methods in recent studies, such as representation learning, deep learning, distributed and parallel learning, transfer learning, active learning, and kernel-based learning.
Abstract: There is no doubt that big data are now rapidly expanding in all science and engineering domains. While the potential of these massive data is undoubtedly significant, fully making sense of them requires new ways of thinking and novel learning techniques to address the various challenges. In this paper, we present a literature survey of the latest advances in researches on machine learning for big data processing. First, we review the machine learning techniques and highlight some promising learning methods in recent studies, such as representation learning, deep learning, distributed and parallel learning, transfer learning, active learning, and kernel-based learning. Next, we focus on the analysis and discussions about the challenges and possible solutions of machine learning for big data. Following that, we investigate the close connections of machine learning with signal processing techniques for big data processing. Finally, we outline several open issues and research trends.

636 citations


Cites background from "Mining of Massive Datasets: Data Mi..."

  • ...In [11, 12], the authors proposed various big data processing models and algorithms from the data mining perspective....

    [...]

Proceedings ArticleDOI
07 Dec 2015
TL;DR: This paper proposes and develops the Deep Feature Synthesis algorithm for automatically generating features for relational datasets, and implements a generalizable machine learning pipeline and tune it using a novel Gaussian Copula process based approach.
Abstract: In this paper, we develop the Data Science Machine, which is able to derive predictive models from raw data automatically. To achieve this automation, we first propose and develop the Deep Feature Synthesis algorithm for automatically generating features for relational datasets. The algorithm follows relationships in the data to a base field, and then sequentially applies mathematical functions along that path to create the final feature. Second, we implement a generalizable machine learning pipeline and tune it using a novel Gaussian Copula process based approach. We entered the Data Science Machine in 3 data science competitions that featured 906 other data science teams. Our approach beats 615 teams in these data science competitions. In 2 of the 3 competitions we beat a majority of competitors, and in the third, we achieved 94% of the best competitor's score. In the best case, with an ongoing competition, we beat 85.6% of the teams and achieved 95.7% of the top submissions score.

297 citations

Journal ArticleDOI
TL;DR: Three classification models are used for text classification using Waikato Environment for Knowledge Analysis (WEKA) and the results show that Nave Bayesian outperformed Decision Tree and KNN in terms of more accuracy, precision, recall and F-measure.
Abstract: Sentiment mining is a field of text mining to determine the attitude of people about a particular product, topic, politician in newsgroup posts, review sites, comments on facebook posts twitter, etc. There are many issues involved in opinion mining. One important issue is that opinions could be in different languages (English, Urdu, Arabic, etc.). To tackle each language according to its orientation is a challenging task. Most of the research work in sentiment mining has been done in English language. Currently, limited research is being carried out on sentiment classification of other languages like Arabic, Italian, Urdu and Hindi. In this paper, three classification models are used for text classification using Waikato Environment for Knowledge Analysis (WEKA). Opinions written in Roman-Urdu and English are extracted from a blog. These extracted opinions are documented in text files to prepare a training dataset containing 150 positive and 150 negative opinions, as labeled examples. Testing data set is supplied to three different models and the results in each case are analyzed. The results show that Nave Bayesian outperformed Decision Tree and KNN in terms of more accuracy, precision, recall and F-measure.

118 citations


Cites methods from "Mining of Massive Datasets: Data Mi..."

  • ...It is used to assign weights to the terms which have relative importance in a corpus (Rajaraman and Ullman, 2011)....

    [...]

Journal ArticleDOI
TL;DR: This paper aims to provide a comprehensive review of Big Data literature of the last 4 years, to identify the main challenges, areas of application, tools and emergent trends of Big data.
Abstract: Big Data has become a very popular term. It refers to the enormous amount of structured, semi-structured and unstructured data that are exponentially generated by high-performance applications in many domains: biochemistry, genetics, molecular biology, physics, astronomy, business, to mention a few. Since the literature of Big Data has increased significantly in recent years, it becomes necessary to develop an overview of the state-of-the-art in Big Data. This paper aims to provide a comprehensive review of Big Data literature of the last 4 years, to identify the main challenges, areas of application, tools and emergent trends of Big Data. To meet this objective, we have analyzed and classified 457 papers concerning Big Data. This review gives relevant information to practitioners and researchers about the main trends in research and application of Big Data in different technical domains, as well as a reference overview of Big Data tools.

117 citations

Proceedings ArticleDOI
26 Apr 2014
TL;DR: This paper introduces the notion of concept evolution, the changing nature of a person's underlying concept which can result in inconsistent labels and thus be detrimental to machine learning, and introduces two structured labeling solutions.
Abstract: Labeling data is a seemingly simple task required for training many machine learning systems, but is actually fraught with problems. This paper introduces the notion of concept evolution, the changing nature of a person's underlying concept (the abstract notion of the target class a person is labeling for, e.g., spam email, travel related web pages) which can result in inconsistent labels and thus be detrimental to machine learning. We introduce two structured labeling solutions, a novel technique we propose for helping people define and refine their concept in a consistent manner as they label. Through a series of five experiments, including a controlled lab study, we illustrate the impact and dynamics of concept evolution in practice and show that structured labeling helps people label more consistently in the presence of concept evolution than traditional labeling.

112 citations