Showing papers in "Annals of Data Science in 2015"

PDF

Open Access

Journal Article•DOI•

A Comprehensive Survey of Clustering Algorithms

[...]

Dongkuan Xu¹, Yingjie Tian¹•Institutions (1)

12 Aug 2015-Annals of Data Science

TL;DR: This review paper begins at the definition of clustering, takes the basic elements involved in the clustering process, such as the distance or similarity measurement and evaluation indicators, into consideration, and analyzes the clustered algorithms from two perspectives, the traditional ones and the modern ones.

...read moreread less

Abstract: Data analysis is used as a common method in modern science research, which is across communication science, computer science and biology science. Clustering, as the basic composition of data analysis, plays a significant role. On one hand, many tools for cluster analysis have been created, along with the information increase and subject intersection. On the other hand, each clustering algorithm has its own strengths and weaknesses, due to the complexity of information. In this review paper, we begin at the definition of clustering, take the basic elements involved in the clustering process, such as the distance or similarity measurement and evaluation indicators, into consideration, and analyze the clustering algorithms from two perspectives, the traditional ones and the modern ones. All the discussed clustering algorithms will be compared in detail and comprehensively shown in Appendix Table 22.

...read moreread less

1,234 citations

Journal Article•DOI•

Forecasting with Big Data: A Review

[...]

Hossein Hassani¹, Emmanuel Sirimal Silva¹•Institutions (1)

Bournemouth University¹

10 Apr 2015-Annals of Data Science

TL;DR: The review finds that at present, the fields of Economics, Energy and Population Dynamics have been the major exploiters of Big Data forecasting whilst Factor models, Bayesian models and Neural Networks are the most common tools adopted for forecasting with Big Data.

...read moreread less

Abstract: Big Data is a revolutionary phenomenon which is one of the most frequently discussed topics in the modern age, and is expected to remain so in the foreseeable future. In this paper we present a comprehensive review on the use of Big Data for forecasting by identifying and reviewing the problems, potential, challenges and most importantly the related applications. Skills, hardware and software, algorithm architecture, statistical significance, the signal to noise ratio and the nature of Big Data itself are identified as the major challenges which are hindering the process of obtaining meaningful forecasts from Big Data. The review finds that at present, the fields of Economics, Energy and Population Dynamics have been the major exploiters of Big Data forecasting whilst Factor models, Bayesian models and Neural Networks are the most common tools adopted for forecasting with Big Data.

...read moreread less

111 citations

Journal Article•DOI•

Novel Approach for Network Traffic Pattern Analysis using Clustering-based Collective Anomaly Detection

[...]

Mohiuddin Ahmed¹, Abdun Naser Mahmood¹•Institutions (1)

University of New South Wales¹

14 May 2015-Annals of Data Science

TL;DR: This paper proposes a framework for collective anomaly detection using a partitional clustering technique to detect anomalies based on an empirical analysis of an attack’s characteristics and validates its approach by comparing its results with those from existing techniques using benchmark datasets.

...read moreread less

Abstract: There is increasing interest in the data mining and network management communities in improving existing techniques for the prompt analysis of underlying traffic patterns. Anomaly detection is one such technique for detecting abnormalities in many different domains, such as computer network intrusion, gene expression analysis, financial fraud detection and many more. Clustering is a useful unsupervised method for both identifying underlying patterns in data and anomaly detection. However, existing clustering-based techniques have high false alarm rates and consider only individual data instances for anomaly detection. Interestingly, there are traffic flows which seem legitimate but are targeted at disrupting a normal computing environment, such as the Denial of Service (DoS) attack. The presence of such anomalous data instances explains the poor performances of existing clustering-based anomaly detection techniques. In this paper, we formulate the problem of detecting DoS attacks as a collective anomaly which is a pattern in the data when a group of similar data instances behave anomalously with respect to the entire dataset. We propose a framework for collective anomaly detection using a partitional clustering technique to detect anomalies based on an empirical analysis of an attack’s characteristics. We validate our approach by comparing its results with those from existing techniques using benchmark datasets.

...read moreread less

53 citations

Journal Article•DOI•

Exploring Big Data Analysis: Fundamental Scientific Problems

[...]

Zongben Xu¹, Yong Shi²•Institutions (2)

Xi'an Jiaotong University¹, Chinese Academy of Sciences²

01 Dec 2015-Annals of Data Science

TL;DR: The paper outlines six open research problems on Big Data, and reports some advances on current Big Data research, particularly in high-dimensional data and non-structured data processing.

...read moreread less

Abstract: Although Big Data has been one of most popular topics since last several years, how to effectively conduct Big Data analysis is a big challenge for every field. This paper tries to address some fundamental scientific problems in Big Data analysis, such as opportunities, challenges, and difficulties encountered in the analysis. The challenges rise from multiple domains that include how Management Science influences data acquisition and data management, Information Science for data access and processing, Mathematics and Statistics for data understanding and Engineering for data applications. The paper outlines six open research problems on Big Data. It also reports some advances on current Big Data research, particularly in high-dimensional data and non-structured data processing. Finally, remarks on how to develop a Big Data algorithm are provided.

...read moreread less

35 citations

Journal Article•DOI•

Preface: Some Advanced Techniques in Data Science

[...]

Yong Shi, Yingjie Tian

01 Mar 2015-Annals of Data Science

22 citations

Journal Article•DOI•

Estimation of Stress-Strength Reliability for the Generalized Pareto Distribution Based on Progressively Censored Samples

[...]

Sadegh Rezaei¹, R. Alizadeh Noughabi¹, Saralees Nadarajah²•Institutions (2)

Amirkabir University of Technology¹, University of Manchester²

03 Apr 2015-Annals of Data Science

TL;DR: In this paper, the estimation of stress-strength reliability parameter, R = P\left(Y < X \right) \), based on progressively type II censored samples when stress, strength are two independent generalized Pareto random variables.

...read moreread less

Abstract: This paper deals with the estimation of stress-strength reliability parameter, $R = P\left( Y < X \right) $, based on progressively type II censored samples when stress, strength are two independent generalized Pareto random variables. The maximum likelihood estimators, their asymptotic distributions, asymptotic confidence intervals, bootstrap based confidence intervals and Bayes estimators are derived for $R$. Using Monte Carlo simulations, the MSE, Bayes risk estimators, credible sets and coverage probabilities are computed and compared.

...read moreread less

21 citations

Journal Article•DOI•

The Information Content of OVX for Crude Oil Returns Analysis and Risk Measurement: Evidence from the Kalman Filter Model

[...]

Yanhui Chen¹, Kaijian He², Kaijian He³, Lean Yu²•Institutions (3)

Shanghai Maritime University¹, Beijing University of Chemical Technology², City University of Hong Kong³

17 Dec 2015-Annals of Data Science

TL;DR: In this paper, the authors analyzed the dynamic relationship between OVX changes and future crude oil price returns with time-varying coefficients, modeled using the Kalman filter, in the regression models.

...read moreread less

Abstract: Crude oil volatility index (OVX) is a new index published by Chicago Board Option Exchange since 2007. In recent years it emerged as an important alternative measure to track and analyze the volatility of future oil prices. In this paper we firstly model and analyze the dynamic relationship between OVX changes and future crude oil price returns with time-varying coefficients, modeled using the Kalman filter, in the regression models. Empirical results show a weak negative relationship between OVX changes and future crude oil price returns movement, and extremely high/low levels of OVX cannot predict future positive/negative returns well. Secondly, this paper explores whether OVX can predict future realized volatility of crude oil price returns . The empirical findings suggest that OVX serves as an unbiased but not an efficient estimate of the future realized volatility and it includes information of the future realized volatility. Finally the incorporation of information of OVX in measuring market risk is analyzed. The empirical result indicates that Kalman filter based model provides the improved performance than the linear regression model in terms of forecasting accuracy for realized volatility prediction and the reliability for VaR estimate.

...read moreread less

16 citations

Journal Article•DOI•

Individual Differences in the Order/Chaos Balance of the Brain Self-Organization

[...]

Hernán Díaz¹, Fernando Maureira², Elías Cohen, Felisa M. Córdova¹, Fredi Edgardo Palominos¹, Jaime Otárola¹, Lucio Cañete¹ - Show less +3 more•Institutions (2)

University of Santiago, Chile¹, Universidad SEK²

10 Dec 2015-Annals of Data Science

TL;DR: Findings include consistencies in the way intra- and inter-individual differences express themselves through the EEG time series data analysis, and some degree of specificity and specialization in the frontal, temporal and occipital locations as well as brain interhemispheric cross-talk interaction modulating the chaos/no-chaos balance in the brain.

...read moreread less

Abstract: We used fractal geometry and fractal dimension introductory argumentation as a framework to start understanding dynamical and complex biological systems to then introduce Hurst exponent estimation of chaos/no-chaos balance trend to explore the phenomenology and the information content of EEG data through time. We searched for measure proxy dynamical variables as potential biomarkers and/or endo-phenotypes that help us to figure out the multidimensionality and different time-scale of simultaneous and crossed functional phenomena that manifests in the brain during executing any challenging task. We found consistencies in the way intra- and inter-individual differences express themselves through the EEG time series data analysis, and some degree of specificity and specialization in the frontal, temporal and occipital locations as well as brain interhemispheric cross-talk interaction modulating the chaos/no-chaos balance in the brain, during a projective process of imaging a dancing choreography. We recorded the brain activity of $$\text {N}=9$$ professional dancers while executing the instruction of to imagine (by mean of a typical projective visualization) a future dancing performance as part of the requirement for to approve a specialization modern dance course and workshop (Kosmos In Movement, 2015).

...read moreread less

13 citations

Journal Article•DOI•

Feature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm

[...]

Li-min Du¹, Li-min Du², Yang Xu¹, Hua Zhu¹•Institutions (2)

Southwest Jiaotong University¹, Henan University²

21 Dec 2015-Annals of Data Science

TL;DR: This method improves the fitness function through using the evaluation criterion EG-mean instead of the global classification accuracy in order to choose the features which are favorable to recognize the minor classes for multi-class imbalanced data sets.

...read moreread less

Abstract: This paper presents an improved genetic algorithm based feature selection method for multi-class imbalanced data. This method improves the fitness function through using the evaluation criterion EG-mean instead of the global classification accuracy in order to choose the features which are favorable to recognize the minor classes. The method is evaluated using several benchmark data sets, and the experimental results show that, compared with the traditional feature selection method based on genetic algorithm, the proposed method has certain advantages in the size of feature subsets and improves the precision of the minor classes for multi-class imbalanced data sets.

...read moreread less

13 citations

Journal Article•DOI•

On the Estimation for the Weibull Distribution

[...]

M. Alizadeh, Sadegh Rezaei¹, S. F. Bagheri²•Institutions (2)

Amirkabir University of Technology¹, Islamic Azad University²

31 Oct 2015-Annals of Data Science

TL;DR: In this paper, the authors consider the estimation of the pdf and the CDF of the Weibull distribution using the following estimators: uniformly minimum variance unbiased, maximum likelihood (ML), percentile, least squares and weight least squares.

...read moreread less

Abstract: Here, we consider estimation of the pdf and the CDF of the Weibull distribution. The following estimators are considered: uniformly minimum variance unbiased, maximum likelihood (ML), percentile, least squares and weight least squares. Analytical expressions are derived for the bias and the mean squared error. Simulation studies and real data applications show that the ML estimator performs better than others.

...read moreread less

12 citations

Journal Article•DOI•

Bayesian Nonparametric Approaches to Abnormality Detection in Video Surveillance

[...]

Vu Nguyen¹, Dinh Phung¹, Duc-Son Pham², Svetha Venkatesh¹•Institutions (2)

Deakin University¹, Curtin University²

07 Apr 2015-Annals of Data Science

TL;DR: This paper revisits the abnormality detection problem through the lens of Bayesian nonparametric (BNP) and develops a novel usage of BNP methods for this problem, employing the Infinite Hidden Markov Model and Bayesian Nonparametric Factor Analysis for stream data segmentation and pattern discovery.

...read moreread less

Abstract: In data science, anomaly detection is the process of identifying the items, events or observations which do not conform to expected patterns in a dataset. As widely acknowledged in the computer vision community and security management, discovering suspicious events is the key issue for abnormal detection in video surveillance. The important steps in identifying such events include stream data segmentation and hidden patterns discovery. However, the crucial challenge in stream data segmentation and hidden patterns discovery are the number of coherent segments in surveillance stream and the number of traffic patterns are unknown and hard to specify. Therefore, in this paper we revisit the abnormality detection problem through the lens of Bayesian nonparametric (BNP) and develop a novel usage of BNP methods for this problem. In particular, we employ the Infinite Hidden Markov Model and Bayesian Nonparametric Factor Analysis for stream data segmentation and pattern discovery. In addition, we introduce an interactive system allowing users to inspect and browse suspicious events.

...read moreread less

Journal Article•DOI•

Testing Exponentiality Based on the Likelihood Ratio and Power Comparison

[...]

Hadi Alizadeh Noughabi¹•Institutions (1)

University of Birjand¹

05 Sep 2015-Annals of Data Science

TL;DR: In this article, some powerful tests for exponentiality based on the likelihood ratio are proposed and the critical points of the test statistics are obtained by Monte Carlo simulations and the power values of the proposed tests are computed against a wide variety of alternative hypotheses and then these values are compared with the power value of the recent published exponentiality tests.

...read moreread less

Abstract: The exponential distribution is one of the fundamental lifetime models and is widely used for describing a failure mechanism of a system. Different applications of this distribution in survival analysis and reliability theory can be found in statistical literature. In this article, some powerful tests for exponentiality based on the likelihood ratio are proposed. The critical points of the test statistics are obtained by Monte Carlo simulations. The power values of the proposed tests are computed against a wide variety of alternative hypotheses and then these values are compared with the power values of the recent published exponentiality tests. It is shown that these tests have a reasonable power for various kinds of departures from exponentiality. For illustrative purpose, real examples are finally presented.

...read moreread less

Journal Article•DOI•

A Neighborhood-based Matrix Factorization Technique for Recommendation

[...]

Meng-jiao Guo¹, Jin-guang Sun¹, Xiang-fu Meng¹•Institutions (1)

Liaoning Technical University¹

18 Dec 2015-Annals of Data Science

TL;DR: This paper incorporates the coupling relationship analysis to capture the under-discovered relationships between items and proposes a neighborhood-based matrix factorization model, which considers both the explicit and implicit correlations between items, to suggest the more reasonable items to user.

...read moreread less

Abstract: The data sparsity and prediction quality are recognized as the key challenges in the existing recommender Systems. Most of the existing recommender systems depend on collaborating flitering (CF) method which mainly leverages the user-item rating matrix representing the relationship between users and items. However, the CF-based method sometimes fails to provide accurate information for predicting recommendations as there is an assumption that the relationship between attributes of items is independent and identically distributed. In real applications, there are often several kinds of coupling relationships or connections existed among users or items. In this paper, we incorporate the coupling relationship analysis to capture the under-discovered relationships between items and aim to make the ratings more reasonable. Next, we propose a neighborhood-based matrix factorization model, which considers both the explicit and implicit correlations between items, to suggest the more reasonable items to user. The experimental evaluations demonstrate that the proposed algorithms outperform the state-of-the-art algorithms in the warm- and cold-start settings.

...read moreread less

Journal Article•DOI•

Refining a Taxonomy by Using Annotated Suffix Trees and Wikipedia Resources

[...]

Ekaterina Chernyak¹, Boris Mirkin¹•Institutions (1)

National Research University – Higher School of Economics¹

02 Apr 2015-Annals of Data Science

TL;DR: The main tool in this is a naturally defined string-to-text relevance score, based on annotated suffix trees, used at several tasks: cleaning the Wikipedia tree or page set of noise; allocating Wikipedia categories to taxonomy topics; deciding whether an allocated category should be included as a child to the taxonomy topic, etc.

...read moreread less

Abstract: A step-by-step approach to taxonomy construction is presented. On the first step, the upper layer frame of taxonomy is built manually according to educational materials. On the next steps, the frame is refined at a chosen topic using the Wikipedia category tree and articles, both cleaned of noise. Our main tool in this is a naturally defined string-to-text relevance score, based on annotated suffix trees. The relevance scoring is used at several tasks: (1) cleaning the Wikipedia tree or page set of noise; (2) allocating Wikipedia categories to taxonomy topics; (3) deciding whether an allocated category should be included as a child to the taxonomy topic, etc. The resulting fragment of taxonomy consists of three parts: the manually set upper layer topic, the adopted part of the Wikipedia category tree and Wikipedia articles as leaves. Every leaf is assigned a set of so-called descriptors; these are phrases explaining aspects of the leaf topic. The method is illustrated by its application to two domains in the area of Mathematics: (a) “Probability theory and mathematical statistics”, (b) “Numerical mathematics” (both in Russian).

...read moreread less

Journal Article•DOI•

An Efficient Variable Selection Method for Predictive Discriminant Analysis

[...]

A. Iduseri¹, J. E. Osemwenkhae¹•Institutions (1)

University of Benin¹

19 Dec 2015-Annals of Data Science

TL;DR: This paper proposes an efficient variable selection method for obtaining a subset of predictors that will be superior to all other subsets from the same historical sample, with a significantly less computational expense.

...read moreread less

Abstract: Seeking a subset of relevant predictor variables for use in predictive model construction in order to simplify the model, obtain shorter training time, as well as enhance generalization by reducing overfitting is a common preprocessing step prior to training a predictive model. In predictive discriminant analysis, the use of classic variable selection methods as a preprocessing step, may lead to “good” overall correct classification within the confusion matrix. However, in most cases, the obtained best subset of predictor variables are not superior (both in terms of the number and combination of the predictor variables, as well as the hit rate obtained when used as training sample) to all other subsets from the same historical sample. Hence the obtained maximum hit rate of the obtained predictive discriminant function is often not optimal even for the training sample that gave birth to it. This paper proposes an efficient variable selection method for obtaining a subset of predictors that will be superior to all other subsets from the same historical sample. In application to real life datasets, the obtained predictive function using our proposed method achieved an actual hit rate that was essentially equal to that of the all-possible-subset method, with a significantly less computational expense.

...read moreread less

Journal Article•DOI•

Trinomial Response Modeling in One Logit Regression

[...]

Stan Lipovetsky¹•Institutions (1)

GfK¹

30 Jul 2015-Annals of Data Science

TL;DR: In this article, a triple ordinal outcome model in one binary logistic regression is proposed, where the binary dependent variable is kept equal one for both positive-negative outcomes and equals zero for neutral outcomes, respectively.

...read moreread less

Abstract: Constructing a triple ordinal outcome model in one binary logistic regression is proposed. Various applied problems can be formulated with response variable of three ordinal categorical levels of negative–neutral–positive kind. Such a response is commonly considered in two ordinal logistic models or in two multinomial shares of three possible outcomes. The current work shows that the problem can be presented in a much simpler and convenient one binomial logistic regression model, so in one probability scale. This approach is based on a special data transformation used in the Best–Worst scaling or MaxDiff modeling, when the positive-neutral data subset is stacked with the negative–neutral subset and in the latter one the predictors’ signs are changed to opposite. The binary dependent variable is kept equal one for both positive–negative outcomes and equals zero for neutral outcomes, respectively. In the obtained one logit regression, the positive category predictions are closer to 1, negative—closer to 0, and neutral are in the middle of its continuous 0–1 scale. Theoretical features and practical application of the model are discussed.

...read moreread less

Journal Article•DOI•

Informational Energy and Its Application in Testing Normality

[...]

Hadi Alizadeh Noughabi¹, Majid Chahkandi¹•Institutions (1)

University of Birjand¹

14 Dec 2015-Annals of Data Science

TL;DR: In this article, a test of fit for normality based on the estimated Informational Energy and using m-step spacings is proposed, and critical values and power values of the test against various alternatives are calculated.

...read moreread less

Abstract: In this article, we propose a test of fit for normality based on the estimated Informational Energy and using m-step spacings. Consistency of the test statistic is established. Critical values and power values of the test against various alternatives are calculated. Finally, the power values of the proposed test are compared with the power values of some prominent normality tests.

...read moreread less

Journal Article•DOI•

Oriental Thinking and Fuzzy Logic, Celebration of the 50th Anniversary of Fuzzy Sets

[...]

Peizhuang Wang

01 Sep 2015-Annals of Data Science

TL;DR: There were five main topics in the conference: Fuzzy information processing and engineering; Internet and big data applications; Factor space and factorial neural networks; Information granulation and granular computing; Extenics and innovation methods.

...read moreread less

Abstract: In August 17–20, 2015, during the spring up of big data tide, we have held an international conference on Oriental Thinking and Fuzzy Logic at Dalian, China to celebrate the 50th anniversary of Fuzzy Sets. The honorary chair for this conference was, of course, the founder of fuzzy sets theory, Prof. L. A. Zadeh. All colleagues at the conference expressed their deep respects to him. Every Chinese scholar who has been met him remembers by his modesty and enthusiasm. To guide the information revolution, he has constructed a great bridge between qualitative and quantitative. There were five main topics in the conference: Fuzzy information processing and engineering; Internet and big data applications; Factor space and factorial neural networks; Information granulation and granular computing; Extenics and innovation methods. Here, Extenics is a new field of the disciplinary, which was initiated by Prof. Wen Cai to achieve innovation facing a problem seems impossible to be possible. The theory of factor space was initiated by me with the oriental thinking. There were 15 plenary talks in the conference as follows:Wen Cai, Fuzzy logic and Extenics; Yixiang Chen, Inter-definability and application of fuzzy logic operators; I. Dzitac, Fuzzy logic and artificial intelligence; Jiali Feng, Theory of meta-synthetic wisdom based on fusion of qualitative, quantitative and imagery operations; Jiafa Gu, System science and Chinese medicine; He Ouyang, A mathematical foundation for factor spaces; Qing He, Uncertainty Learning; Congfu Huang, An approach checking whether an intelligent internet can improve intelligence; Deyi Li, Cognitive physics; Zengliang Liu, Factorial neural networks; W. Pedrycz, New frontiers of computing

...read moreread less

Journal Article•DOI•

Segmentation of Chinese Urban Real Estate Market: A Demand-Supply Distribution Perspective

[...]

Jichang Dong¹, Xiuting Li¹, Wencong Li¹, Zhi Dong¹•Institutions (1)

Chinese Academy of Sciences¹

17 Dec 2015-Annals of Data Science

TL;DR: Wang et al. as discussed by the authors proposed a new perspective on the analysis of the regional features of real estate market and explored a more reliable segmentation method based on the optimization of supply-demand resource distribution.

...read moreread less

Abstract: This study proposed a new perspective on the analysis of the regional features of real estate market and explored a more reliable segmentation method for Chinese urban real estate market based on the optimization of supply-demand resource distribution. A two-stage clustering procedure is proposed based on supply and demand elements and market performance respectively. And six clustering algorithms were used to divide 283 Chinese cities at the prefecture level or above into three clusters and 13 sub-clusters, which are identified as key regulatory region, stable development region and region that needs policy support. Differentiated regulatory policy suggestions are accordingly provided for each cluster.

...read moreread less

Journal Article•DOI•

The Research on the Application of Qualitative Mapping in MapReduce

[...]

Jiali Feng¹, Guanglin Xu², Xiaolin Xu³•Institutions (3)

Shanghai Maritime University¹, Shanghai Lixin University of Commerce², Shanghai Second Polytechnic University³

18 Dec 2015-Annals of Data Science

TL;DR: The solution procedure of qualitative mapping can be a new way of transforming data for MapReduce, and two examples are given to illustrate how to use qualitative mapping model to transforming semi-structured or unstructured data.

...read moreread less

Abstract: MapReduce is a mathematical tool handling the large-scale data sets through paralleling and distributive calculation. Currently the operations of MapReduce mainly include sorting, grouping and joining, etc. This paper undertakes a research on qualitative mapping and MapReduce, and finds that the solution procedure of qualitative mapping can be a new way of transforming data for MapReduce. Two examples are given to illustrate how to use qualitative mapping model to transforming semi-structured or unstructured data.

...read moreread less

Journal Article•DOI•

Entropy Estimation Using Numerical Methods

[...]

Hadi Alizadeh Noughabi¹•Institutions (1)

University of Birjand¹

16 Oct 2015-Annals of Data Science

TL;DR: In this article, the authors used some numerical methods to estimate the entropy of a continuous random variable and then some estimators are introduced, and the results indicate that the proposed estimators have smaller mean squared error than other estimators.

...read moreread less

Abstract: Direct integration of the Riemann–Stieltjes integral has been used to computing convolution integrals. This approach has been established to be simple and accurate with good convergence property. In this paper, we used some numerical methods to estimation of entropy of a continuous random variable and then some estimators are introduced. Bounds on the error terms are derived for some direct Riemann–Stieltjes integration methods. Consistency of estimators is proved and by simulation, the proposed estimators are compared with some prominent estimators, namely Correa (Commun Stat Theory Methods 24:2439–2449, 1995), Ebrahimi et al. (Stat Probab Lett 20:225–234, 1994), van Es (Scand J Stat 19:61–72, 1992) and Vasicek (J R Stat Soc B 38:54–59, 1976). The results indicate that the proposed estimators have smaller mean squared error than other estimators.

...read moreread less

Journal Article•DOI•

Mining Fuzzy Association Rules in the Framework of AFS Theory

[...]

Bo Wang¹, Xiao-dong Liu¹, Li-dong Wang¹•Institutions (1)

Dalian Maritime University¹

24 Dec 2015-Annals of Data Science

TL;DR: A new fuzzy association rule mining algorithm in the framework of AFS (Axiomatic Fuzzy Sets) theory, which has the advantage of its simplicity in implementation and mathematical beauty in fuzzy theory, and can be directly applied to extract fuzzy association rules in real data systems.

...read moreread less

Abstract: In this paper, firstly we study the representations and fuzzy logic operations for the fuzzy concepts in real data systems. Secondly, we propose a new fuzzy association rule mining algorithm in the framework of AFS (Axiomatic Fuzzy Sets) theory. Compared with the current algorithms, the advantage of proposed algorithm has two advantages. One is that the membership functions of the fuzzy sets representing the extracted rules and the fuzzy logic operations applied to extract fuzzy rules are determined by the distribution of the data, instead of the fuzzy sets defined by some special functions, t-norm, t-conorm, negation operator, implication operator and fuzzy similarity relation given in advance. The extracted fuzzy rules are interpretable and similar to human intuition. Another is that its simplicity in implementation and mathematical beauty in fuzzy theory, and can be directly applied to extract fuzzy association rules in real data systems. Finally, a well-known example Iris dataset is used to illustrate the effectiveness of the new algorithm based on the proposed degrees of implication. We obtained reclassification accuracy 98 %.

...read moreread less

Journal Article•DOI•

How to Measure Rhetorical Impact of Teaching and their Levels of Persuasion: A Neuro-rhetoric Approach

[...]

Lucio Cañete¹, Hernán Díaz¹, Felisa M. Córdova¹, Tania Soto¹, Eduardo A. Reinao¹, Fredi Edgardo Palominos¹ - Show less +2 more•Institutions (1)

University of Santiago, Chile¹

14 Dec 2015-Annals of Data Science

TL;DR: Additional and supplementary on-field results indicate that teachers who prefer affect Causalities, Purposes and Laws; are the most persuasive.

...read moreread less

Abstract: This paper explore the question about how persuasive is a person, a professor in our interest, depending on his/her rhetoric. Since persuasion is an act for amending the mind, a model to describe this intellectual entity in students consists of seven categories of elements in it: Quality, Quantity, Space, Time, Causality, Purpose and Law. According to the emphasis that the persuader places in each category of mental elements, he/she is classified in a rhetorical manner. The effect over the system of beliefs of a student caused by each rhetoric manner will be different depending on the way the new paradigm confront older or misunderstood previous classifications, organizations, and conceptualizations. Since the processing of a discourse occur mentally, not necessarily expressed by notorious disturbances in the behavior, we explore the way to measure the attentional impact of two lecturer: (i) a human being; and (ii) a computer machine, on the EEG activity of three listeners. We extracted the upper 25 % of EEG signal intensity for the whole frequency 0.1–64 Hz range, to proceed to analyze the graphical data extracted from the time/frequency/intensity EEG spectrograms. Additional and supplementary on-field results indicate that teachers who prefer affect Causalities, Purposes and Laws; are the most persuasive.

...read moreread less

Journal Article•DOI•

Transform Group of Monotonic Functions with the Same Monotonicity on [ $-$ 1, 1] and Operations of Fuzzy Numbers

[...]

Sicong Guo¹, Ying Zhao², Ying Zhao¹•Institutions (2)

Liaoning Technical University¹, Northeastern University²

18 Dec 2015-Annals of Data Science

TL;DR: This paper defines the transformation of monotonic bounded functions with same monotonicity on the symmetric interval [$$-$$-1, 1], and the four fundamental operations of fuzzy numbers based on the fuzzy structured element to start a new way for studying on the theory and application of fuzzy analysis.

...read moreread less

Abstract: Operations of fuzzy numbers are the main content of the fuzzy mathematical analysis. This paper defines the transformation of monotonic bounded functions with same monotonicity on the symmetric interval [$-$1, 1], and the four fundamental operations of fuzzy numbers based on the fuzzy structured element. It not only make operations of fuzzy numbers easier, but also start a new way for studying on the theory and application of fuzzy analysis.

...read moreread less

Journal Article•DOI•

Indebted Households Profiling: A Knowledge Discovery from Database Approach

[...]

Rodrigo Arnaldo Scarpel¹, Alexandros Ladas², Uwe Aickelin²•Institutions (2)

Instituto Tecnológico de Aeronáutica¹, University of Nottingham²

01 Apr 2015-Annals of Data Science

TL;DR: This work employed a knowledge discovery from database process to identify groups of indebted households and describe their profiles using a database collected by the Consumer Credit Counselling Service (CCCS) in the UK.

...read moreread less

Abstract: A major challenge in consumer credit risk portfolio management is to classify households according to their risk profile. In order to build such risk profiles it is necessary to employ an approach that analyses data systematically in order to detect important relationships, interactions, dependencies and associations amongst the available continuous and categorical variables altogether and accurately generate profiles of most interesting household segments according to their credit risk. The objective of this work is to employ a knowledge discovery from database process to identify groups of indebted households and describe their profiles using a database collected by the Consumer Credit Counselling Service (CCCS) in the UK. Employing a framework that allows the usage of both categorical and continuous data altogether to find hidden structures in unlabelled data it was established the ideal number of clusters and such clusters were described in order to identify the households who exhibit a high propensity of excessive debt levels.

...read moreread less

Journal Article•DOI•

A Fuzzy Trustworthiness System with Probability Presentation Based on Center-of-gravity Method

[...]

Yu-Bin Zhong¹, Zeng-liang Liu², Xuehai Yuan³•Institutions (3)

Guangzhou University¹, National Defence University, Pakistan², Dalian University of Technology³

01 Sep 2015-Annals of Data Science

TL;DR: The paper researches the fuzzy trustworthiness system and probability presentation theory based on bounded product implication and Larsen square implication and gives the sufficient conditions of universal approximations for those fuzzytrustworthiness systems.

...read moreread less

Abstract: Fuzzy methods are widely used in the study of trustworthiness. Based on this fact, the paper researches the fuzzy trustworthiness system and probability presentation theory based on bounded product implication and Larsen square implication. Firstly, we convert a group of single-input and single-output data into fuzzy inference rules and generate fuzzy relation by selecting the appropriate fuzzy implication operator, then calculate joint probability density function of two-dimensional random variables by using of this fuzzy relation. Two specific probability density functions can be obtained by selecting the fuzzy implication as bounded product implication or Larsen square implication. Secondly, we study the marginal distribution and numerical characteristics of these two kinds probability distributions and point out that two of these probability distributions have the same mathematical expectation and nearly the same variance and covariance. Finally, we study of the center-of-gravity fuzzy trustworthiness system based on these two probability distributions. We gave the sufficient conditions of universal approximations for those fuzzy trustworthiness systems.

...read moreread less

Journal Article•DOI•

Modular Real-Time Face Detection System

[...]

Kaiyu Wang¹, Zhiming Song², Menglin Sheng², Ping He², Zhenan Tang² - Show less +1 more•Institutions (2)

Dalian University of Technology¹, Harbin Institute of Technology²

23 Dec 2015-Annals of Data Science

TL;DR: In this article, the authors proposed an architecture of face detection based on FPGA and general purpose CPU, where the USB bus is used as the communication bridge between the two modules.

...read moreread less

Abstract: In this paper, a novel system architecture of face detection in possession of modular characteristic is proposed, and the corresponding face detection method is described, to match with the proposed architecture. First of all, the proposed architecture of face detection consists of two modules, namely, the coprocessor module of face detection based on FPGA and target system module, which hopes to implement finial face detection, based on general purpose CPU, and USB bus is used as the communication bridge between the two modules. Secondly, taking the characteristics of FPGA and general purpose CPU into consideration, face detection algorithm can be divided into two layers. The first layer of face detection algorithm based on skin color and eyes’ graylevel variation is implemented in the FPGA, and then the corresponding detection results and image are transmitted to the second module by USB bus so as to further detect face using the algorithm combining principle component analysis with support vector machine, which is referred to as the second layer of algorithm. Because the second layer of the algorithms are operations of float-point and loop, it implemented in the general purpose CPU. This architecture enables face detection to be implemented not only in high performance computing platform in possession of USB bus interface, but also in small terminal products and low-end embedded systems, where the performance of processor and the resource of hardware are limited. Actual testing results show that the proposed system architecture can implement real-time face detection for the images with 640 $$\times $$ 480 resolution, and the detection accuracy is about 89 %.

...read moreread less

Journal Article•DOI•

Identifying High-Number-Cluster Structures in RFID Ski Lift Gates Entrance Data

[...]

Boris Delibasic¹, Zoran Obradovic²•Institutions (2)

University of Belgrade¹, Temple University²

09 Jul 2015-Annals of Data Science

TL;DR: Three representative algorithms from three most widely used clustering algorithm families are utilized and produced to produce 40 algorithm settings for clustering skiing groups and showed to be very efficient in terms of finding the high-number-cluster structure (skiing groups) and for detecting models suitable for injury prevention.

...read moreread less

Abstract: In this paper we identify skier groups in data from RFID ski lift gates entrances. The ski lift gates' entrances are real-life data covering a 5-year period from the largest Serbian skiing resort with a 32,000 skier per hour ski lift capacity. We utilize three representative algorithms from three most widely used clustering algorithmfamilies(representative-based,hierarchical,anddensitybased)andproduce 40 algorithm settings for clustering skiing groups. Ski pass sales data was used to validate the produced clustering models. It was assumed that persons who bought ski tickets together are more likely to ski together. AMI and ARI clustering validation measures are reported for each model. In addition, the applicability of the proposed models was evaluated for ski injury prevention. Each clustering model was tested on whether skiing in groups increases risk of injury. Hierarchical clustering algorithms showedtobeveryefficientintermsoffindingthehigh-number-clusterstructure(skiing groups) and for detecting models suitable for injury prevention. Most of the tested clustering algorithms models supported the hypothesis that skiing in groups increases risk of injury.

...read moreread less

Journal Article•DOI•

Multilevel Modeling of the Progression of HIV/AIDS Disease Among Patients Under HAART Treatment

[...]

Awol Seid¹•Institutions (1)

Haramaya University¹

22 Oct 2015-Annals of Data Science

TL;DR: Gender, baseline clinical stage and functional status of the patient have a significant association with the progression of the disease and the selection of the random coefficients nonproportional odds model is chosen as the best model as it has the smallest DIC value.

...read moreread less

Abstract: Human immune deficiency virus results a noncurable disease acquired immuno deficiency syndrome (AIDS). After a person is infected with virus, the virus gradually destroys all the infection fighting cells called CD4 cells and makes the individual susceptible to opportunistic infections which cause severe or fatal health problems. The most effective treatment for the disease is the highly active antiretroviral therapy (HAART) which requires a lifelong commitment to adhere diligently to daily medications, dosing schedules and making frequent clinic visits. Several studies show that the CD4 cells count is the most determinant indicator of the effectiveness of the treatment or progression of the disease. The objective of this paper is to investigate the progression of the disease over time among patients under HAART treatment. Two main approaches of the generalized multilevel ordinal models; namely the proportional odds model and the nonproportional odds model have been applied to the HAART data. Also, the multilevel part of both models include random intercepts and random coefficients. In general, four models are explored in the analysis and then the models are compared using the deviance information criteria. Of these, the random coefficients nonproportional odds model is selected as the best model for the HAART data used as it has the smallest DIC value. This selected model shows that the progression of the disease increases as the time under the treatment increases. In addition it reveals that gender, baseline clinical stage and functional status of the patient have a significant association with the progression of the disease.

...read moreread less

Journal Article•DOI•

Goal-Programming-Based Procedure for Calculating Positive Multipliers Under a Multiple Criteria Data Envelopment Analysis Framework: An Application to UEFA EURO 2012

[...]

Ana Paula dos Santos Rubem¹, Luana Carneiro Brandão¹, João Carlos Correia Baptista Soares de Mello¹•Institutions (1)

Federal Fluminense University¹

14 Dec 2015-Annals of Data Science

TL;DR: This paper proposes an alternative procedure, based on goal programming, for calculating positive multipliers within a MCDEA framework, and results indicate that, to assure non-null multipliers, it is necessary a mild detachment of non-dominated solutions.

...read moreread less

Abstract: One of the motivations for the arise of the multiple criteria data envelopment analysis (MCDEA) model was the need to yield more reasonable input-output multipliers than those derived from standard data envelopment analysis (DEA), without using priori information. The problem of unreasonable multipliers occurs when some production units are efficient in standard DEA simply because the optimization problem allows those units to select few inputs/outputs to attach positive multipliers, discarding all of the others. Notwithstanding, MCDEA may fail in providing multipliers schemes free of non-null values. Therefore, in this paper, we propose an alternative procedure, based on goal programming, for calculating positive multipliers within a MCDEA framework. This procedure is applied to a previously reported problem, concerning the performance evaluation of national teams that participated in the 2012 UEFA European Football Championship (UEFA EURO 2012), where the MCDEA model has not succeeded in providing strictly positive multipliers schemes. The results derived by the proposed procedure indicate that, to assure non-null multipliers, it is necessary a mild detachment of non-dominated solutions.

...read moreread less