What are the future works in "Distributed feature selection for efficient economic big data analysis" ?

In the future work, the authors plan to establish a platform of algorithm library based on the proposed framework.

What have the authors contributed in "Distributed feature selection for efficient economic big data analysis" ?

To address the challenges, this paper presents a new framework for efficient analysis of high-dimensional economic big data based on innovative distributed feature selection.

What is the density of the data point with the highest remaining density?

After the reduction, the data point with the highest remaining density is selected as the second cluster center and the density of each data point is further reduced according to its distance to the second cluster center.

What is the importance of the a-th attribute to select the k-th representative?

the importance of the a-th attribute to select the k-th representative economic record can be defined asI(a)k = n∑i=1I(i, a)k. (7)The attributes with the higher-ranking value contain more information of clusters than others, namely they have powerful impacts on typical economic phenomena analysis.

How are the two components of the economic data preprocessing and economic feature selection deployed?

1. Specially, to speed up the process of data analysis, the Economic Data Preprocess and EconomicFeature Selection are deployed in distributed platform [36].•

How is the density of the data set selected?

in order to avoid the points near the first cluster center being selected as other centers of clusters, an amount of density proportional is subtracted from each point to its distance from the first cluster center.

(Open Access) Distributed Feature Selection for Efficient Economic Big Data Analysis (2018) | Liang Zhao

Q: What are the three main groups of feature selection algorithms?

With respect to different selection strategies, feature selection algorithms can be categorized into four groups, namely the filter, wrapper, embedded, and hybrid methods.

Q: How can the authors estimate the effects of inflation on economic growth?

By approaching data on developing economies, the semi-parametric method can estimate the potentially nonlinear effects of inflation on economic growth [9].

ORE Open Research Exeter

TITLE

Distributed feature selection for eﬃcient economic big data analysis

AUTHORS

Zhao, L; Chen, Z; Hu, Y; et al.

JOURNAL

IEEE Transactions on Big Data

DEPOSITED IN ORE

14 February 2017

This version available at

http://hdl.handle.net/10871/25841

Open Research Exeter makes this work available in accordance with publisher policies.

A NOTE ON VERSIONS

The version presented here may diﬀer from the published version. If citing, you are advised to consult the published version for pagination, volume/issue and date of

publication

JOURNAL OF L

X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 1

Distributed Feature Selection for Efﬁcient

Economic Big Data Analysis

Liang Zhao, Zhikui Chen, Senior Member, IEEE, Yueming Hu, Geyong Min, Senior Member, IEEE,

and Zhaohua Jiang

Abstract—With the rapidly increasing popularity of economic activities, a large amount of economic data is being collected. Although

such data offers super opportunities for economic analysis, its low-quality, high-dimensionality and huge-volume pose great challenges

on efﬁcient analysis of economic big data. The existing methods have primarily analyzed economic data from the perspective of

econometrics, which involves limited indicators and demands prior knowledge of economists. When embracing large varieties of

economic factors, these methods tend to yield unsatisfactory performance. To address the challenges, this paper presents a new

framework for efﬁcient analysis of high-dimensional economic big data based on innovative distributed feature selection. Speciﬁcally,

the framework combines the methods of economic feature selection and econometric model construction to reveal the hidden patterns

for economic development. The functionality rests on three pillars: (i) novel data pre-processing techniques to prepare high-quality

economic data, (ii) an innovative distributed feature identiﬁcation solution to locate important and representative economic indicators

from multidimensional data sets, and (iii) new econometric models to capture the hidden patterns for economic development. The

experimental results on the economic data collected in Dalian, China, demonstrate that our proposed framework and methods have

superior performance in analyzing enormous economic data.

Index Terms—feature selection, big data, subtractive clustering, collaborative theory, economy, urbanization

1 INTRODUCTION

IG data, as a term often deﬁned around four V’s: Vol-

ume, Velocity, Variety, and Veracity has attracted many

interests in solving social and economic problems, with an-

ticipation of efﬁcient organizations and decision-making [1].

For example, the World Economic Forum claimed that big

data had signiﬁcant and would provide new opportunities

for international development in 2012 [2]. The White House

also published the white paper in May 2014, stating that

big data offered a marvelous opportunity for the econo-

my, people’s health and education, national security, and

energy efﬁciency of the United States [3]. However, only

having massive data is inadequate, because our interests

are focused on the valuable information, that is usually

characterized by ’Value’ instead of the four V’s, buried in

the mass [37]. Therefore, to support social and economic

development, the key is to capture valuable information,

meanings, and insights hidden in big data.

With the increasing popularity of economic activities,

• Liang Zhao is with the School of Software Technology, Dalian

University of Technology, Dalian 116600, China. E-mail:

matthew1988zhao@mail.dlut.edu.cn.

• Zhikui Chen is with the School of Software Technology, Dalian University

of Technology, and the Key Laboratory for Ubiquitous Network and

Service Software of Liaoning Province, Dalian 116600, China. E-mail:

zkchen@dlut.edu.cn.

• Yueming Hu is with the College of Natural Resources and Environment,

South China Agricultural University, Guangzhou 510642, China. E-mail:

ymhu163@163.com.

• Geyong Min is with the College of Engineering, Mathematics and

Physical Sciences, University of Exeter, Exeter EX4 4QF, U.K. E-mail:

g.min@exeter.ac.uk.

• Zhaohua Jiang is with the School of Public Administration and

Law, Dalian University of Technology, Dalian 116024, China. E-mail:

jiang zhaohua@163.com.

Manuscript received April 19, 2005; revised September 17, 2014.

a large number of factors and records are involved in

economic development. At present, the volume of data in

many ﬁnancial institutions is more than 100TB in China.

Meanwhile, an average of about 820GB data is produced

continuously for 1 million dollars in revenue for a bank.

In addition, the electronic commerce and other economic

activities also produce enormous data for economic analysis

constantly. For example, in double eleven shopping festival

of Alibaba 2014, there were a total of 240 million Internet

users visiting Taobao, making the trading volume peak

at 2.85 million in one minute. The total turnover reached

at 57.1 billion yuan, resulting in 278.5 million package

deliveries. While all of these provide sufﬁcient informa-

tion for economic analysis, the issues of dimension and

volume overload pose great challenges: (1) The collected

huge volume data usually contains incomplete, incorrect

and nonstandard items, which are difﬁcult for processing.

(2) The high-dimensionality of economic indicators makes

manual factors selection for economic model construction

impossible. (3) Statistical analysis software (e.g. Statistical

Product and Service Solutions, SPSS) often generates run-

time errors when dealing with the high-dimensionality and

huge-volume economic data. Hence, it is necessary to pro-

vide an efﬁcient way to extract the useful features contained

in the massive data. Then the extracted features can be used

to identify valuable information through economic models

analysis. Such valuable information extraction process calls

for novel economic big data analysis frameworks and ad-

vanced mining techniques.

Unfortunately, there are few intelligent schemas that

can be used to gain actionable knowledge and valuable

insights from the large amount of economic data. For e-

conomic development, most of the existing methods are

JOURNAL OF L

X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 2

involved with econometric analysis [4-6], including basic

element method, cost saving method, elements and internal

associations method, and retarded economy method. They

exploit econometric models, such as cointegration model [7],

regression model [8], semi-parametric model [9], hypothesis

model [10] and hybrid model, to quantitatively analyze the

relations between response indicators and economic devel-

opment. Thus the effects of them on economic development

can be obtained. However, most existing methods identi-

fy the response factors related to economic development

based on past experience and directly embody them into

production function to build the correlations with economic

growth, overlooking the indirect effects caused by other

factors related to them. Besides, the existing methods rely

too much on the knowledge of economists and embrace

limited indicators and records for analysis, without fully

considering the intrinsic characteristics of high-dimensional

economic data. Therefore, they cannot effectively reveal the

impacts of response indictors on economic development.

To address these challenges, we explore the hidden re-

lations between economy and its response indicators from

a new angle and extract the meaningful knowledge from

economic big data in order to derive right insights and con-

clusions based on an innovative distributed feature selection

framework that integrates advanced feature selection tech-

niques and econometric methods. First, in order to reduce

the noise yet promote the data quality, we propose to use

usability preprocessing, relative annual price computation,

growth rate computation and normalization techniques to

clean and transform the collected economic big data. Then,

to distill the features related to economic development

from high-dimensional economic data, distributed feature

selection methods are proposed to quickly partition the

importance of given economic indicators. After that, the

relations between response indicators and economic growth

can be established by conducting correlative and collabo-

rative analysis. Our main contributions are summarized as

follows:

• We present a novel framework combining distribut-

ed feature selection methods and econometric mod-

els for efﬁcient economic analysis, which can reveal

the valuable insights from the low-quality, high-

dimensionality, and huge-volume economic big data.

• We develop a subtractive clustering based feature se-

lection algorithm and an attribute coordination based

clustering algorithm to select and identify the impor-

tant features of data in horizontally and vertically.

Also, we extend these two methods to distributed

platform for economic big data analysis.

• We conduct correlative and collaborative analysis

simultaneously to explore the direct and indirect re-

lations between economy and its response indicators

based on the identiﬁed economic features.

• We evaluate the proposed framework and algorithms

on the economic development data in Dalian, a fast

developing city in China, over the past 30 years.

Extensive experiments and analysis demonstrate that

the designed framework and algorithms can distill

the hidden patterns of economic development ef-

ﬁciently and the achieved results accord with the

actual development situation in Dalian city.

The rest of this paper is organized as follows. Section

2 reviews related works on feature selection and econo-

metric analysis methods. Section 3 formulates the problem

to be addressed and introduces our proposed framework

for economic big data analysis. The subtractive clustering

based feature selection method and attribute coordination

based clustering method, as well as their parallel methods

are described in Section 4. Section 5 presents the process-

es of constructing economic models and demonstrates the

efﬁciency of the proposed methods through a case study.

Section 6 concludes the paper and directs future work.

2 RELATED WORK

This section reviews related works on feature selection and

econometric methods.

2.1 The feature selection methods

Feature selection aims to process multidimensional data by

detecting the relevant features and discarding the irrelevant

ones. Effective feature selection can lead to reduction of

measurement costs yet generate a better understanding of

the original domain [11, 12, 30, 31, 33]. With respect to d-

ifferent selection strategies, feature selection algorithms can

be categorized into four groups, namely the ﬁlter, wrapper,

embedded, and hybrid methods.

The ﬁlter methods present the feature selection process

independent of any classiﬁer and evaluate the relevance of

a feature by studying the characteristics of training data us-

ing certain statistical criteria. The correlation-based feature

selection [13], consistency-based ﬁlter [14], information gain

[15], relief [16], ﬁsher score [17], and minimum redundancy

maximum relevance [18] are the most representative ﬁlter

techniques.

The wrapper methods integrate a classiﬁer, such as SVM

[21], KNN [25], and LDA [12], to select a set of features

that have the most discriminative power. Representative

wrapper feature selection methods include: wrapperC4.5

[19], wrapperSVM, FSSEM [20], and ℓ

SVM [21]. Other

examples of the wrapper method could be any combination

of a preferred search strategy and given classiﬁers.

The embedded methods perform feature selection in the

process of training and achieve model ﬁtting to a given

learning mechanism simultaneously. For example, SVM-

RFE [22] trains the current features of the given data set

by a SVM classiﬁer and removes the least important features

indicated by the SVM iteratively to achieve feature selection.

Other embedded methods include FS-P [23], BlogReg and

SBMLR [24].

In summary, the ﬁlter methods, independent of any clas-

siﬁer, have lower computational complexity than wrapper

methods yet with favorable generalization ability. Unlike

ﬁlters, the wrapper methods are superior to ﬁlters in terms

of classiﬁcation accuracy, whereas they take more time due

to the cost of expensive computation. The embedded meth-

ods, with lower computational cost than wrappers, are also

integrated with classiﬁers, leading the risk of over-ﬁtting.

JOURNAL OF L

X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 3

Due to the shortcomings in each method, the hybrid meth-

ods [26, 27, 29] are proposed to bridge the gaps between

them. However, the existing feature selection methods are

incapable of being adapted to economic analysis. Since they

analyze the data through its inherent knowledge charac-

teristics, they cannot identify the feature cointegration and

intrinsic association between economic indicators. Besides,

the low-quality and huge-volume characteristics of econom-

ic big data present great challenges when the existing feature

selection methods are directly applied to process inductive

analysis.

2.2 The econometric methods

Econometric analysis, based on economic theory and data,

uses mathematical and statistical methods to study the

quantitative relations and rules of economy [4,5]. The ex-

isting econometric studies on economic development and

its response factors address the f ollowing aspects:

First, basic elements are applied to describe the mech-

anism of economic growth. The economic growth can be

promoted by increasing consumption and investment, as

well as affecting related decisive factors. When approaching

economic analysis, the contributing factors are selected to

identify the relations between them and economic develop-

ment. Second, from the perspective of cost saving, urban-

ization can bring more workforces into city, which reduces

the economic costs and boosts facilities sharing to cut down

transaction costs. Meanwhile, through the agglomeration

and diffusion effects, the economic growth can be accelerat-

ed. Third, elements and internal associations are involved to

comprehensively explain the correlations between economy

and its decisive factors. For example, Brant integrates two

aggregate production function models, one with urbaniza-

tion as a shift factor and the other that combines energy

consumption and physical capital, to estimate the internal

relevance among urbanization, energy consumption, and

economic growth [6]. In addition, some researchers pose

retarded economy theory to argue the restraining factors for

economic development.

Moreover, there are an army of quantitative studies

concentrating on this thesis [6-10], such as cointegration

analysis, regression analysis, semi parametric methods, hy-

pothesis methods and hybrid methods. Sajal et al. approach

threshold cointegration method to examine the cointegrat-

ing relationship between energy consumption, urbanization

and economic activity for India [7]. In [8], the authors use

a regression model, that allows the relationship between

ﬁnance and economic growth to be piecewise linear, based

on the concept of threshold effects to reveal the effects of

ﬁnance on economic growth. By approaching data on devel-

oping economies, the semi-parametric method can estimate

the potentially nonlinear effects of inﬂation on economic

growth [9]. Moreover, in [10], the hypothesis is established

that variation in migratory distance has a long-lasting effect

on genetic diversity and the pattern of economic devel-

opment. Based on this, the effects of genetic diversity on

economic development can be obtained by approaching

regression analysis.

Although all the methods mentioned above can shed

light on the patterns of economic development, they rely

too much on the past experience and the knowledge of

economists. Besides, they involve limited indicators and

records for analysis, which will yield unsatisfactory results

when approaching high-dimensional economic data.

3 DISTRIBUTED ECONOMIC BIG DATA ANALYSIS

In this section, we deﬁne the problem of statement of

economic big data analysis, and then present a framework

based on distributed f eature selection.

3.1 Problem statement

The increasing economy related activities provide a wide

range of indicators and records for economic analysis. Fac-

ing such large amount of data, how to detect useful informa-

tion from it has drawn extensive attention in academia and

industry. Traditional econometric methods cannot embrace

the high-dimensionality data since they only involve limited

economic factors for model construction based on past ex-

periences. For example, some economists analyze economic

development from the perspective of industrial structure.

They select three indicators, namely the added value of

primary industry, secondary industry and tertiary industry,

to establish the production function for predicting GDP

growth. Obviously, the obtained result is not persuasive

because many other indicators also have impact on the econ-

omy. Besides, the existing statistical analysis software (e.g.

SPSS) would generate runtime errors when dealing with

the high-dimensionality and huge-volume economic data.

While some methods are able to process the massive data,

their computation costs are expensive [26-28]. Therefore, we

aim to provide an efﬁcient way to bridge the gap between

data analysis methods and economic big data in real word.

Speciﬁcally, it consists of two major tasks.

Task 1 : F eatur e Selection. Let A = {a

, a

, ..., a

}

be a corpus of m economic indicators. Among these m

indicators, there are m

′

features more relevant to economic

development than others. And they can be grouped into

k clusters according to their internal relevances. We aim

to select the m

′

features and partition them to k groups

, c

, ...c

} with the representative features as centroids.

Task 2 : Econometric Model Construction. For each

cluster c

, we aim to conduct correlative analysis between

the representative feature and other related ones to generate

relational model. By combining all the models based on

collaborative analysis, we can establish the economic pre-

diction model.

Economic big data analysis is important and challeng-

ing in many ways. In the next subsection, we present a

novel framework by combing distributed feature selection

and econometric analysis to achieve the task of predicting

economic development.

3.2 The framework of economic big data analysis

Our proposed framework consists of three phases,

1)Economic Data P reprocess, 2)Economic F eature

Selection, and 3)Economic Model Construction, as

shown in Fig. 1. Specially, to speed up the process of data

analysis, the Economic Data P reprocess and Economic

JOURNAL OF L

X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 4

Economic

Data

Economic

Data

ĉĉ

Economic

Data

Economic

Data

Preprocess

Economic

Feature

Selection

Economic

Model

Construction

Data Usability

Processing

Relative Annual

Price Computation

Growth Rate

Computation

Normalization

by Attributes

Important

Attributes

Distibuted Platform

Vertical

Subtractive Clustering

Representative

Attributes

Collaborative and

Correlative Analysis

噯

a1 a2 am

...

噯

a1 a2 am

...

noice

Missing

value

噯

a1 a2 am

...

Baseline

噯

a1 a2 am

...

噯

a1 a2 am

...

[0-1]

噯

a1 a2 am

...

Horizontal Subtractive

Clustering

ĂĂ

Representative

Records

噯

Attributes sorted by their importances in

ascending.

Important Attributes

Attribute

Coordination

噯

Representative Attributes

Important Attributes

噯

Representative Attributes

噯

Collaborative Model

Construction

Factors Analysis for

Economic Development

Relational Model

Construction

Fig. 1. The proposed framework for economic big data analysis. It in-

cludes three components: (1) Economic Data Preprocess; (2) Economic

Feature Selection; and (3) Economic Model Construction.

F eature Selection are deployed in distributed platform

[36].

• Economic Data P r eprocess

The raw data always contains the most important infor-

mation. However, it is difﬁcult to mine useful information

from the mass as it is mixing with incomplete, incorrect

and nonstandard items. Thus the methods that can improve

the data quality should be developed for economic big data

analysis. We propose to exploit the methods of noise elim-

ination [28] and missing value imputation [32] to enhance

the data usability. For the inﬂuence of inﬂation or deﬂation,

the currency prices corresponding to economic indicators in

different years cannot be measured directly. In this paper, we

project the economic data to the same domain with baseline

data in 2012 based on corresponding price indexes, so that

the data in different year can be processed fairly. As a rule

of thumb, the growth rate of economic indicators can better

reﬂect economic development than their raw forms. Hence,

we compute the relative growth rates of a year to its previ-

ous year for all numerical indicators. Moreover, to avoid the

inﬂuence of absolute values on the analytic results, the min-

max normalization technique for all numerical attributes is

approached to unify all attribute values to the same metric

space.

• Economic F eature Selection

The preprocessed data obtained from the ﬁrst phase

is unsuitable for econometric analysis due to its high-

dimensionality. Therefore, it is essential to select the repre-

sentative economic indicators and their related important

ones for econometric model construction. To tackle this

problem, we propose a two-stage distributed subtractive

clustering based feature selection method. Firstly, the impor-

tant attributes that are more relevant to economic develop-

ment are selected by the horizontal distributed subtractive

clustering. Secondly, by approaching the improved attribute

coordination based distributed subtractive clustering on the

selected attributes vertically, we can gain the representative

attributes.

• Economic M odel Construction

With the combination of the selected indicators, we

can construct the economic prediction models. However,

a weakness of most traditional econometric methods for

constructing models is that they take no consideration of the

indirect relations between response indicators and economic

factors. For example, many existing methods combine the

representative factors with urbanization to establish the

relational models between urbanization and economic de-

velopment [6, 7]. Obviously, they ignore the indirect effects

of urbanization on the important factors that are related

to the representative ones. Hence, we integrate correlative

and collaborative analysis simultaneously in this work to

construct novel economic models.

In sum, our proposed framework outperforms the ex-

isting econometric methods for economic big data analysis.

The economic big data usually has the characteristics of low-

quality, high-dimensionality and huge-volume, which pose

great challenges to existing econometric methods. To tackle

these problems, we propose a three-layer model to embrace

all related data for efﬁcient economic analysis. Firstly, the

low-quality and huge-volume economic data is cleaned to

improve the data usability and transformed to consist with

economic rules. After that, the attributes that can represent

the high-dimensionality and huge-volume economic data

are selected by the distributed feature selection method,

which can fully consider the relationships among attributes

yet reduce the inﬂuences of past experience in indicator

selection for economic analysis. Finally, the correlative and

collaborative analysis are approached to distill the direct

and indirect corrections among the selected indicators, thus

to construct the distinctive economic models.

4 A DISTRIBUTED FEATURE SELECTION MODEL

This paper aims to reduce the potentially huge set of candi-

date attributes produced by the preprocess layer to a small

set of possible attributes, which are diverse and similar to

the attributes in the original data set. However, there is no

universal method for all problem settings, so we design a

novel, systematic attribute selection approach for economic

analysis. Our objectives of such an ideal approach are two-

fold: (i) the parallel subtractive clustering is generalized to s-

elect important attributes, and (ii) the attribute coordination

based parallel clustering is designed to identify representa-

tive ones. Thus, we can make full use of the representative

factors and their related important factors to mine the direct

and indirect effects on economic d evelopment.

4.1 Important attribute selection

For economic analysis, some records may be related to

other records and some indicators can be represented by the

combination of other indicators. Therefore, by approaching

Distributed Feature Selection for Efficient Economic Big Data Analysis

Figures

Citations

Data Mining Practical Machine Learning Tools and Techniques

An Adaptive Dropout Deep Computation Model for Industrial IoT Big Data Learning With Crowdsourcing to Cloud Computing

Energy-Efficient Scheduling for Real-Time Systems Based on Deep Q-Learning Model

Privacy-Preserving Double-Projection Deep Computation Model With Crowdsourcing on Cloud for Big Data Feature Learning

Incomplete multi-view clustering via deep semantic mapping

References

Data Mining: Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques

Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy

Feature selection based on mutual information: criteria ofmax-dependency, max-relevance, and min-redundancy

Correlation-based Feature Selection for Machine Learning

Related Papers (5)

An introduction to variable and feature selection

An Incremental CFS Algorithm for Clustering Large Data in Industrial Internet of Things

Business Applications for Current Developments in Big Data Clustering: An Overview

How we could realize big data value

Text Mining for Educational Literature on Big Data with Hadoop

Frequently Asked Questions (14)

Q1. What are the main methods used for economic analysis?

Q2. What are the future works in "Distributed feature selection for efficient economic big data analysis" ?

Q3. What have the authors contributed in "Distributed feature selection for efficient economic big data analysis" ?

Q4. What are the three main groups of feature selection algorithms?

Q5. How can the authors estimate the effects of inflation on economic growth?

Q6. What is the density of the data point with the highest remaining density?

Q7. What is the importance of the a-th attribute to select the k-th representative?

Q8. What is the key to capturing valuable information, meanings, and insights hidden in big data?

Q9. What is the main weakness of traditional methods for constructing models?

Q10. How are the two components of the economic data preprocessing and economic feature selection deployed?

Q11. How is the density of the data set selected?

Q12. What is the effect of migratory distance on economic development?

Q13. What is the purpose of this paper?

Q14. What are the main issues of the big data analysis?