scispace - formally typeset
Open AccessJournal ArticleDOI

Distributed Feature Selection for Efficient Economic Big Data Analysis

TLDR
A new framework for efficient analysis of high-dimensional economic big data based on innovative distributed feature selection and econometric model construction to reveal the hidden patterns for economic development is presented.
Abstract
With the rapidly increasing popularity of economic activities, a large amount of economic data is being collected. Although such data offers super opportunities for economic analysis, its low-quality, high-dimensionality and huge-volume pose great challenges on efficient analysis of economic big data. The existing methods have primarily analyzed economic data from the perspective of econometrics, which involves limited indicators and demands prior knowledge of economists. When embracing large varieties of economic factors, these methods tend to yield unsatisfactory performance. To address the challenges, this paper presents a new framework for efficient analysis of high-dimensional economic big data based on innovative distributed feature selection. Specifically, the framework combines the methods of economic feature selection and econometric model construction to reveal the hidden patterns for economic development. The functionality rests on three pillars: (i) novel data pre-processing techniques to prepare high-quality economic data, (ii) an innovative distributed feature identification solution to locate important and representative economic indicators from multidimensional data sets, and (iii) new econometric models to capture the hidden patterns for economic development. The experimental results on the economic data collected in Dalian, China, demonstrate that our proposed framework and methods have superior performance in analyzing enormous economic data.

read more

Content maybe subject to copyright    Report

ORE Open Research Exeter
TITLE
Distributed feature selection for efficient economic big data analysis
AUTHORS
Zhao, L; Chen, Z; Hu, Y; et al.
JOURNAL
IEEE Transactions on Big Data
DEPOSITED IN ORE
14 February 2017
This version available at
http://hdl.handle.net/10871/25841
COPYRIGHT AND REUSE
Open Research Exeter makes this work available in accordance with publisher policies.
A NOTE ON VERSIONS
The version presented here may differ from the published version. If citing, you are advised to consult the published version for pagination, volume/issue and date of
publication

JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 1
Distributed Feature Selection for Efficient
Economic Big Data Analysis
Liang Zhao, Zhikui Chen, Senior Member, IEEE, Yueming Hu, Geyong Min, Senior Member, IEEE,
and Zhaohua Jiang
Abstract—With the rapidly increasing popularity of economic activities, a large amount of economic data is being collected. Although
such data offers super opportunities for economic analysis, its low-quality, high-dimensionality and huge-volume pose great challenges
on efficient analysis of economic big data. The existing methods have primarily analyzed economic data from the perspective of
econometrics, which involves limited indicators and demands prior knowledge of economists. When embracing large varieties of
economic factors, these methods tend to yield unsatisfactory performance. To address the challenges, this paper presents a new
framework for efficient analysis of high-dimensional economic big data based on innovative distributed feature selection. Specifically,
the framework combines the methods of economic feature selection and econometric model construction to reveal the hidden patterns
for economic development. The functionality rests on three pillars: (i) novel data pre-processing techniques to prepare high-quality
economic data, (ii) an innovative distributed feature identification solution to locate important and representative economic indicators
from multidimensional data sets, and (iii) new econometric models to capture the hidden patterns for economic development. The
experimental results on the economic data collected in Dalian, China, demonstrate that our proposed framework and methods have
superior performance in analyzing enormous economic data.
Index Terms—feature selection, big data, subtractive clustering, collaborative theory, economy, urbanization
F
1 INTRODUCTION
B
IG data, as a term often defined around four V’s: Vol-
ume, Velocity, Variety, and Veracity has attracted many
interests in solving social and economic problems, with an-
ticipation of efficient organizations and decision-making [1].
For example, the World Economic Forum claimed that big
data had significant and would provide new opportunities
for international development in 2012 [2]. The White House
also published the white paper in May 2014, stating that
big data offered a marvelous opportunity for the econo-
my, people’s health and education, national security, and
energy efficiency of the United States [3]. However, only
having massive data is inadequate, because our interests
are focused on the valuable information, that is usually
characterized by ’Value’ instead of the four V’s, buried in
the mass [37]. Therefore, to support social and economic
development, the key is to capture valuable information,
meanings, and insights hidden in big data.
With the increasing popularity of economic activities,
Liang Zhao is with the School of Software Technology, Dalian
University of Technology, Dalian 116600, China. E-mail:
matthew1988zhao@mail.dlut.edu.cn.
Zhikui Chen is with the School of Software Technology, Dalian University
of Technology, and the Key Laboratory for Ubiquitous Network and
Service Software of Liaoning Province, Dalian 116600, China. E-mail:
zkchen@dlut.edu.cn.
Yueming Hu is with the College of Natural Resources and Environment,
South China Agricultural University, Guangzhou 510642, China. E-mail:
ymhu163@163.com.
Geyong Min is with the College of Engineering, Mathematics and
Physical Sciences, University of Exeter, Exeter EX4 4QF, U.K. E-mail:
g.min@exeter.ac.uk.
Zhaohua Jiang is with the School of Public Administration and
Law, Dalian University of Technology, Dalian 116024, China. E-mail:
jiang zhaohua@163.com.
Manuscript received April 19, 2005; revised September 17, 2014.
a large number of factors and records are involved in
economic development. At present, the volume of data in
many financial institutions is more than 100TB in China.
Meanwhile, an average of about 820GB data is produced
continuously for 1 million dollars in revenue for a bank.
In addition, the electronic commerce and other economic
activities also produce enormous data for economic analysis
constantly. For example, in double eleven shopping festival
of Alibaba 2014, there were a total of 240 million Internet
users visiting Taobao, making the trading volume peak
at 2.85 million in one minute. The total turnover reached
at 57.1 billion yuan, resulting in 278.5 million package
deliveries. While all of these provide sufficient informa-
tion for economic analysis, the issues of dimension and
volume overload pose great challenges: (1) The collected
huge volume data usually contains incomplete, incorrect
and nonstandard items, which are difficult for processing.
(2) The high-dimensionality of economic indicators makes
manual factors selection for economic model construction
impossible. (3) Statistical analysis software (e.g. Statistical
Product and Service Solutions, SPSS) often generates run-
time errors when dealing with the high-dimensionality and
huge-volume economic data. Hence, it is necessary to pro-
vide an efficient way to extract the useful features contained
in the massive data. Then the extracted features can be used
to identify valuable information through economic models
analysis. Such valuable information extraction process calls
for novel economic big data analysis frameworks and ad-
vanced mining techniques.
Unfortunately, there are few intelligent schemas that
can be used to gain actionable knowledge and valuable
insights from the large amount of economic data. For e-
conomic development, most of the existing methods are

JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 2
involved with econometric analysis [4-6], including basic
element method, cost saving method, elements and internal
associations method, and retarded economy method. They
exploit econometric models, such as cointegration model [7],
regression model [8], semi-parametric model [9], hypothesis
model [10] and hybrid model, to quantitatively analyze the
relations between response indicators and economic devel-
opment. Thus the effects of them on economic development
can be obtained. However, most existing methods identi-
fy the response factors related to economic development
based on past experience and directly embody them into
production function to build the correlations with economic
growth, overlooking the indirect effects caused by other
factors related to them. Besides, the existing methods rely
too much on the knowledge of economists and embrace
limited indicators and records for analysis, without fully
considering the intrinsic characteristics of high-dimensional
economic data. Therefore, they cannot effectively reveal the
impacts of response indictors on economic development.
To address these challenges, we explore the hidden re-
lations between economy and its response indicators from
a new angle and extract the meaningful knowledge from
economic big data in order to derive right insights and con-
clusions based on an innovative distributed feature selection
framework that integrates advanced feature selection tech-
niques and econometric methods. First, in order to reduce
the noise yet promote the data quality, we propose to use
usability preprocessing, relative annual price computation,
growth rate computation and normalization techniques to
clean and transform the collected economic big data. Then,
to distill the features related to economic development
from high-dimensional economic data, distributed feature
selection methods are proposed to quickly partition the
importance of given economic indicators. After that, the
relations between response indicators and economic growth
can be established by conducting correlative and collabo-
rative analysis. Our main contributions are summarized as
follows:
We present a novel framework combining distribut-
ed feature selection methods and econometric mod-
els for efficient economic analysis, which can reveal
the valuable insights from the low-quality, high-
dimensionality, and huge-volume economic big data.
We develop a subtractive clustering based feature se-
lection algorithm and an attribute coordination based
clustering algorithm to select and identify the impor-
tant features of data in horizontally and vertically.
Also, we extend these two methods to distributed
platform for economic big data analysis.
We conduct correlative and collaborative analysis
simultaneously to explore the direct and indirect re-
lations between economy and its response indicators
based on the identified economic features.
We evaluate the proposed framework and algorithms
on the economic development data in Dalian, a fast
developing city in China, over the past 30 years.
Extensive experiments and analysis demonstrate that
the designed framework and algorithms can distill
the hidden patterns of economic development ef-
ficiently and the achieved results accord with the
actual development situation in Dalian city.
The rest of this paper is organized as follows. Section
2 reviews related works on feature selection and econo-
metric analysis methods. Section 3 formulates the problem
to be addressed and introduces our proposed framework
for economic big data analysis. The subtractive clustering
based feature selection method and attribute coordination
based clustering method, as well as their parallel methods
are described in Section 4. Section 5 presents the process-
es of constructing economic models and demonstrates the
efficiency of the proposed methods through a case study.
Section 6 concludes the paper and directs future work.
2 RELATED WORK
This section reviews related works on feature selection and
econometric methods.
2.1 The feature selection methods
Feature selection aims to process multidimensional data by
detecting the relevant features and discarding the irrelevant
ones. Effective feature selection can lead to reduction of
measurement costs yet generate a better understanding of
the original domain [11, 12, 30, 31, 33]. With respect to d-
ifferent selection strategies, feature selection algorithms can
be categorized into four groups, namely the filter, wrapper,
embedded, and hybrid methods.
The filter methods present the feature selection process
independent of any classifier and evaluate the relevance of
a feature by studying the characteristics of training data us-
ing certain statistical criteria. The correlation-based feature
selection [13], consistency-based filter [14], information gain
[15], relief [16], fisher score [17], and minimum redundancy
maximum relevance [18] are the most representative filter
techniques.
The wrapper methods integrate a classifier, such as SVM
[21], KNN [25], and LDA [12], to select a set of features
that have the most discriminative power. Representative
wrapper feature selection methods include: wrapperC4.5
[19], wrapperSVM, FSSEM [20], and
1
SVM [21]. Other
examples of the wrapper method could be any combination
of a preferred search strategy and given classifiers.
The embedded methods perform feature selection in the
process of training and achieve model fitting to a given
learning mechanism simultaneously. For example, SVM-
RFE [22] trains the current features of the given data set
by a SVM classifier and removes the least important features
indicated by the SVM iteratively to achieve feature selection.
Other embedded methods include FS-P [23], BlogReg and
SBMLR [24].
In summary, the filter methods, independent of any clas-
sifier, have lower computational complexity than wrapper
methods yet with favorable generalization ability. Unlike
filters, the wrapper methods are superior to filters in terms
of classification accuracy, whereas they take more time due
to the cost of expensive computation. The embedded meth-
ods, with lower computational cost than wrappers, are also
integrated with classifiers, leading the risk of over-fitting.

JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 3
Due to the shortcomings in each method, the hybrid meth-
ods [26, 27, 29] are proposed to bridge the gaps between
them. However, the existing feature selection methods are
incapable of being adapted to economic analysis. Since they
analyze the data through its inherent knowledge charac-
teristics, they cannot identify the feature cointegration and
intrinsic association between economic indicators. Besides,
the low-quality and huge-volume characteristics of econom-
ic big data present great challenges when the existing feature
selection methods are directly applied to process inductive
analysis.
2.2 The econometric methods
Econometric analysis, based on economic theory and data,
uses mathematical and statistical methods to study the
quantitative relations and rules of economy [4,5]. The ex-
isting econometric studies on economic development and
its response factors address the f ollowing aspects:
First, basic elements are applied to describe the mech-
anism of economic growth. The economic growth can be
promoted by increasing consumption and investment, as
well as affecting related decisive factors. When approaching
economic analysis, the contributing factors are selected to
identify the relations between them and economic develop-
ment. Second, from the perspective of cost saving, urban-
ization can bring more workforces into city, which reduces
the economic costs and boosts facilities sharing to cut down
transaction costs. Meanwhile, through the agglomeration
and diffusion effects, the economic growth can be accelerat-
ed. Third, elements and internal associations are involved to
comprehensively explain the correlations between economy
and its decisive factors. For example, Brant integrates two
aggregate production function models, one with urbaniza-
tion as a shift factor and the other that combines energy
consumption and physical capital, to estimate the internal
relevance among urbanization, energy consumption, and
economic growth [6]. In addition, some researchers pose
retarded economy theory to argue the restraining factors for
economic development.
Moreover, there are an army of quantitative studies
concentrating on this thesis [6-10], such as cointegration
analysis, regression analysis, semi parametric methods, hy-
pothesis methods and hybrid methods. Sajal et al. approach
threshold cointegration method to examine the cointegrat-
ing relationship between energy consumption, urbanization
and economic activity for India [7]. In [8], the authors use
a regression model, that allows the relationship between
finance and economic growth to be piecewise linear, based
on the concept of threshold effects to reveal the effects of
finance on economic growth. By approaching data on devel-
oping economies, the semi-parametric method can estimate
the potentially nonlinear effects of inflation on economic
growth [9]. Moreover, in [10], the hypothesis is established
that variation in migratory distance has a long-lasting effect
on genetic diversity and the pattern of economic devel-
opment. Based on this, the effects of genetic diversity on
economic development can be obtained by approaching
regression analysis.
Although all the methods mentioned above can shed
light on the patterns of economic development, they rely
too much on the past experience and the knowledge of
economists. Besides, they involve limited indicators and
records for analysis, which will yield unsatisfactory results
when approaching high-dimensional economic data.
3 DISTRIBUTED ECONOMIC BIG DATA ANALYSIS
In this section, we define the problem of statement of
economic big data analysis, and then present a framework
based on distributed f eature selection.
3.1 Problem statement
The increasing economy related activities provide a wide
range of indicators and records for economic analysis. Fac-
ing such large amount of data, how to detect useful informa-
tion from it has drawn extensive attention in academia and
industry. Traditional econometric methods cannot embrace
the high-dimensionality data since they only involve limited
economic factors for model construction based on past ex-
periences. For example, some economists analyze economic
development from the perspective of industrial structure.
They select three indicators, namely the added value of
primary industry, secondary industry and tertiary industry,
to establish the production function for predicting GDP
growth. Obviously, the obtained result is not persuasive
because many other indicators also have impact on the econ-
omy. Besides, the existing statistical analysis software (e.g.
SPSS) would generate runtime errors when dealing with
the high-dimensionality and huge-volume economic data.
While some methods are able to process the massive data,
their computation costs are expensive [26-28]. Therefore, we
aim to provide an efficient way to bridge the gap between
data analysis methods and economic big data in real word.
Specifically, it consists of two major tasks.
Task 1 : F eatur e Selection. Let A = {a
1
, a
2
, ..., a
m
}
be a corpus of m economic indicators. Among these m
indicators, there are m
features more relevant to economic
development than others. And they can be grouped into
k clusters according to their internal relevances. We aim
to select the m
features and partition them to k groups
{c
1
, c
2
, ...c
k
} with the representative features as centroids.
Task 2 : Econometric Model Construction. For each
cluster c
i
, we aim to conduct correlative analysis between
the representative feature and other related ones to generate
relational model. By combining all the models based on
collaborative analysis, we can establish the economic pre-
diction model.
Economic big data analysis is important and challeng-
ing in many ways. In the next subsection, we present a
novel framework by combing distributed feature selection
and econometric analysis to achieve the task of predicting
economic development.
3.2 The framework of economic big data analysis
Our proposed framework consists of three phases,
1)Economic Data P reprocess, 2)Economic F eature
Selection, and 3)Economic Model Construction, as
shown in Fig. 1. Specially, to speed up the process of data
analysis, the Economic Data P reprocess and Economic

JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 4
Economic
Data
Economic
Data
ĉĉ
Economic
Data
Economic
Data
Preprocess
Economic
Feature
Selection
Economic
Model
Construction
Data Usability
Processing
Relative Annual
Price Computation
Growth Rate
Computation
Normalization
by Attributes
Important
Attributes
Distibuted Platform
Vertical
Subtractive Clustering
Representative
Attributes
Collaborative and
Correlative Analysis
r
1
r
2
r
n
a1 a2 am
...
r
3
r
1
r
2
r
n
a1 a2 am
...
r
3
noice
Missing
value
r
1
r
2
r
n
a1 a2 am
...
r
3
Baseline
r
1
r
2
r
n
a1 a2 am
...
r
3
r
1
r
2
a1 a2 am
...
r
3
[0-1]
r
n
r
1
r
2
a1 a2 am
...
r
3
r
n
Horizontal Subtractive
Clustering
ĂĂ
Representative
Records
+
Attributes sorted by their importances in
ascending.
Important Attributes
+
Attribute
Coordination
Representative Attributes
Important Attributes
Representative Attributes
Collaborative Model
Construction
Factors Analysis for
Economic Development
+
Relational Model
Construction
Fig. 1. The proposed framework for economic big data analysis. It in-
cludes three components: (1) Economic Data Preprocess; (2) Economic
Feature Selection; and (3) Economic Model Construction.
F eature Selection are deployed in distributed platform
[36].
Economic Data P r eprocess
The raw data always contains the most important infor-
mation. However, it is difficult to mine useful information
from the mass as it is mixing with incomplete, incorrect
and nonstandard items. Thus the methods that can improve
the data quality should be developed for economic big data
analysis. We propose to exploit the methods of noise elim-
ination [28] and missing value imputation [32] to enhance
the data usability. For the influence of inflation or deflation,
the currency prices corresponding to economic indicators in
different years cannot be measured directly. In this paper, we
project the economic data to the same domain with baseline
data in 2012 based on corresponding price indexes, so that
the data in different year can be processed fairly. As a rule
of thumb, the growth rate of economic indicators can better
reflect economic development than their raw forms. Hence,
we compute the relative growth rates of a year to its previ-
ous year for all numerical indicators. Moreover, to avoid the
influence of absolute values on the analytic results, the min-
max normalization technique for all numerical attributes is
approached to unify all attribute values to the same metric
space.
Economic F eature Selection
The preprocessed data obtained from the first phase
is unsuitable for econometric analysis due to its high-
dimensionality. Therefore, it is essential to select the repre-
sentative economic indicators and their related important
ones for econometric model construction. To tackle this
problem, we propose a two-stage distributed subtractive
clustering based feature selection method. Firstly, the impor-
tant attributes that are more relevant to economic develop-
ment are selected by the horizontal distributed subtractive
clustering. Secondly, by approaching the improved attribute
coordination based distributed subtractive clustering on the
selected attributes vertically, we can gain the representative
attributes.
Economic M odel Construction
With the combination of the selected indicators, we
can construct the economic prediction models. However,
a weakness of most traditional econometric methods for
constructing models is that they take no consideration of the
indirect relations between response indicators and economic
factors. For example, many existing methods combine the
representative factors with urbanization to establish the
relational models between urbanization and economic de-
velopment [6, 7]. Obviously, they ignore the indirect effects
of urbanization on the important factors that are related
to the representative ones. Hence, we integrate correlative
and collaborative analysis simultaneously in this work to
construct novel economic models.
In sum, our proposed framework outperforms the ex-
isting econometric methods for economic big data analysis.
The economic big data usually has the characteristics of low-
quality, high-dimensionality and huge-volume, which pose
great challenges to existing econometric methods. To tackle
these problems, we propose a three-layer model to embrace
all related data for efficient economic analysis. Firstly, the
low-quality and huge-volume economic data is cleaned to
improve the data usability and transformed to consist with
economic rules. After that, the attributes that can represent
the high-dimensionality and huge-volume economic data
are selected by the distributed feature selection method,
which can fully consider the relationships among attributes
yet reduce the influences of past experience in indicator
selection for economic analysis. Finally, the correlative and
collaborative analysis are approached to distill the direct
and indirect corrections among the selected indicators, thus
to construct the distinctive economic models.
4 A DISTRIBUTED FEATURE SELECTION MODEL
This paper aims to reduce the potentially huge set of candi-
date attributes produced by the preprocess layer to a small
set of possible attributes, which are diverse and similar to
the attributes in the original data set. However, there is no
universal method for all problem settings, so we design a
novel, systematic attribute selection approach for economic
analysis. Our objectives of such an ideal approach are two-
fold: (i) the parallel subtractive clustering is generalized to s-
elect important attributes, and (ii) the attribute coordination
based parallel clustering is designed to identify representa-
tive ones. Thus, we can make full use of the representative
factors and their related important factors to mine the direct
and indirect effects on economic d evelopment.
4.1 Important attribute selection
For economic analysis, some records may be related to
other records and some indicators can be represented by the
combination of other indicators. Therefore, by approaching

Citations
More filters
Journal ArticleDOI

An Adaptive Dropout Deep Computation Model for Industrial IoT Big Data Learning With Crowdsourcing to Cloud Computing

TL;DR: The results demonstrate that the proposed model can prevent overfitting effectively and aggregate the labeled samples to train the parameters of the deep computation model with crowdsouring for industrial IoT big data feature learning.
Journal ArticleDOI

Energy-Efficient Scheduling for Real-Time Systems Based on Deep Q-Learning Model

TL;DR: An energy-efficient scheduling scheme based on deep Q-learning model is proposed for periodic tasks in real-time systems (DQL-EES) and demonstrated that the proposed algorithm can save average more energy than QL-HDS.
Journal ArticleDOI

Privacy-Preserving Double-Projection Deep Computation Model With Crowdsourcing on Cloud for Big Data Feature Learning

TL;DR: A double-projection deep computation model (DPDCM) for big data feature learning, which projects the raw input into two separate subspaces in the hidden layers to learn interacted features of big data by replacing thehidden layers of the conventional deep computationmodel (DCM) with double- projection layers is presented.
Journal ArticleDOI

Incomplete multi-view clustering via deep semantic mapping

TL;DR: A novel incompletemulti-view clustering method, which projects all incomplete multi-view data to a complete and unified representation in a common subspace, and a new objective function is developed and the optimization processes are presented.
References
More filters
Book

Data Mining: Practical Machine Learning Tools and Techniques

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Journal ArticleDOI

Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy

TL;DR: In this article, the maximal statistical dependency criterion based on mutual information (mRMR) was proposed to select good features according to the maximal dependency condition. But the problem of feature selection is not solved by directly implementing mRMR.

Feature selection based on mutual information: criteria ofmax-dependency, max-relevance, and min-redundancy

TL;DR: This work derives an equivalent form, called minimal-redundancy-maximal-relevance criterion (mRMR), for first-order incremental feature selection, and presents a two-stage feature selection algorithm by combining mRMR and other more sophisticated feature selectors (e.g., wrappers).

Correlation-based Feature Selection for Machine Learning

Mark Hall
TL;DR: This thesis addresses the problem of feature selection for machine learning through a correlation based approach with CFS (Correlation based Feature Selection), an algorithm that couples this evaluation formula with an appropriate correlation measure and a heuristic search strategy.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What are the main methods used for economic analysis?

They exploit econometric models, such as cointegration model [7], regression model [8], semi-parametric model [9], hypothesis model [10] and hybrid model, to quantitatively analyze the relations between response indicators and economic development. 

In the future work, the authors plan to establish a platform of algorithm library based on the proposed framework. 

To address the challenges, this paper presents a new framework for efficient analysis of high-dimensional economic big data based on innovative distributed feature selection. 

With respect to different selection strategies, feature selection algorithms can be categorized into four groups, namely the filter, wrapper, embedded, and hybrid methods. 

By approaching data on developing economies, the semi-parametric method can estimate the potentially nonlinear effects of inflation on economic growth [9]. 

After the reduction, the data point with the highest remaining density is selected as the second cluster center and the density of each data point is further reduced according to its distance to the second cluster center. 

the importance of the a-th attribute to select the k-th representative economic record can be defined asI(a)k = n∑i=1I(i, a)k. (7)The attributes with the higher-ranking value contain more information of clusters than others, namely they have powerful impacts on typical economic phenomena analysis. 

to support social and economic development, the key is to capture valuable information, meanings, and insights hidden in big data. 

a weakness of most traditional econometric methods for constructing models is that they take no consideration of the indirect relations between response indicators and economic factors. 

1. Specially, to speed up the process of data analysis, the Economic Data Preprocess and EconomicFeature Selection are deployed in distributed platform [36].• 

in order to avoid the points near the first cluster center being selected as other centers of clusters, an amount of density proportional is subtracted from each point to its distance from the first cluster center. 

in [10], the hypothesis is established that variation in migratory distance has a long-lasting effect on genetic diversity and the pattern of economic development. 

This paper aims to reduce the potentially huge set of candidate attributes produced by the preprocess layer to a small set of possible attributes, which are diverse and similar to the attributes in the original data set. 

While all of these provide sufficient information for economic analysis, the issues of dimension and volume overload pose great challenges: (1) The collected huge volume data usually contains incomplete, incorrect and nonstandard items, which are difficult for processing.