Showing papers in &quot;Progress in Artificial Intelligence in 2020&quot;

Improving financial bankruptcy prediction in a highly imbalanced class distribution using oversampling and ensemble learning: a case from the Spanish market

TL;DR: This paper mainly focus on the application of deep learning architectures to three major applications, namely (i) wild animal detection, (ii) small arm detection and (iii) human being detection.

...read moreread less

Abstract: Deep learning has developed as an effective machine learning method that takes in numerous layers of features or representation of the data and provides state-of-the-art results. The application of deep learning has shown impressive performance in various application areas, particularly in image classification, segmentation and object detection. Recent advances of deep learning techniques bring encouraging performance to fine-grained image classification which aims to distinguish subordinate-level categories. This task is extremely challenging due to high intra-class and low inter-class variance. In this paper, we provide a detailed review of various deep architectures and model highlighting characteristics of particular model. Firstly, we described the functioning of CNN architectures and its components followed by detailed description of various CNN models starting with classical LeNet model to AlexNet, ZFNet, GoogleNet, VGGNet, ResNet, ResNeXt, SENet, DenseNet, Xception, PNAS/ENAS. We mainly focus on the application of deep learning architectures to three major applications, namely (i) wild animal detection, (ii) small arm detection and (iii) human being detection. A detailed review summary including the systems, database, application and accuracy claimed is also provided for each model to serve as guidelines for future work in the above application areas.

...read moreread less

435 citations

Journal Article•DOI•

[...]

Hossam Faris¹, Ruba Abukhurma¹, Waref Almanaseer¹, Mohammed Saadeh¹, Antonio M. Mora², Pedro A. Castillo², Ibrahim Aljarah¹ - Show less +3 more•Institutions (2)

University of Jordan¹, University of Granada²

A new approach for the vanishing gradient problem on sigmoid activation

TL;DR: A hybrid approach that combines the synthetic minority oversampling technique with ensemble methods is proposed that proves that the proposed approach can be used as an efficient alternative in case of highly imbalanced datasets.

...read moreread less

Abstract: Bankruptcy is one of the most critical financial problems that reflects the company’s failure. From a machine learning perspective, the problem of bankruptcy prediction is considered a challenging one mainly because of the highly imbalanced distribution of the classes in the datasets. Therefore, developing an efficient prediction model that is able to detect the risky situation of a company is a challenging and complex task. To tackle this problem, in this paper, we propose a hybrid approach that combines the synthetic minority oversampling technique with ensemble methods. Moreover, we apply five different feature selection methods to find out what are the most dominant attributes on bankruptcy prediction. The proposed approach is evaluated based on a real dataset collected from Spanish companies. The conducted experiments show promising results, which prove that the proposed approach can be used as an efficient alternative in case of highly imbalanced datasets.

...read moreread less

61 citations

Journal Article•DOI•

[...]

Matías Roodschild¹, Jorge Gotay Sardiñas¹, Adrián Will¹•Institutions (1)

National Technological University¹

Learning similarity measures from data

TL;DR: A modification of the backpropagation algorithm for the sigmoid neurons training is proposed that suggests that the derivative’s modification produces the same accuracy in fewer training steps on most datasets.

...read moreread less

Abstract: The vanishing gradient problem (VGP) is an important issue at training time on multilayer neural networks using the backpropagation algorithm. This problem is worse when sigmoid transfer functions are used, in a network with many hidden layers. However, the sigmoid function is very important in several architectures such as recurrent neural networks and autoencoders, where the VGP might also appear. In this article, we propose a modification of the backpropagation algorithm for the sigmoid neurons training. It consists of adding a small constant to the calculation of the sigmoid’s derivative so that the proposed training direction differs slightly from the gradient while keeping the original sigmoid function in the network. This approach suggests that the derivative’s modification produces the same accuracy in fewer training steps on most datasets. Moreover, due to VGP, the original derivative does not converge using sigmoid functions on more than five hidden layers. However, the modification allows backpropagation to train two extra hidden layers in feedforward neural networks.

...read moreread less

46 citations

Journal Article•DOI•

[...]

Bjørn Magnus Mathisen¹, Agnar Aamodt¹, Kerstin Bach¹, Helge Langseth¹•Institutions (1)

Norwegian University of Science and Technology¹

Cost-sensitive ensemble methods for bankruptcy prediction in a highly imbalanced data distribution: a real case from the Spanish market

TL;DR: The main motivation for this work is to automate the construction of similarity measures using machine learning while keeping training time as low as possible, and to investigate how to apply machine learning to effectively learn a similarity measure.

...read moreread less

Abstract: Defining similarity measures is a requirement for some machine learning methods. One such method is case-based reasoning (CBR) where the similarity measure is used to retrieve the stored case or a set of cases most similar to the query case. Describing a similarity measure analytically is challenging, even for domain experts working with CBR experts. However, datasets are typically gathered as part of constructing a CBR or machine learning system. These datasets are assumed to contain the features that correctly identify the solution from the problem features; thus, they may also contain the knowledge to construct or learn such a similarity measure. The main motivation for this work is to automate the construction of similarity measures using machine learning. Additionally, we would like to do this while keeping training time as low as possible. Working toward this, our objective is to investigate how to apply machine learning to effectively learn a similarity measure. Such a learned similarity measure could be used for CBR systems, but also for clustering data in semi-supervised learning, or one-shot learning tasks. Recent work has advanced toward this goal which relies on either very long training times or manually modeling parts of the similarity measure. We created a framework to help us analyze the current methods for learning similarity measures. This analysis resulted in two novel similarity measure designs: The first design uses a pre-trained classifier as basis for a similarity measure, and the second design uses as little modeling as possible while learning the similarity measure from data and keeping training time low. Both similarity measures were evaluated on 14 different datasets. The evaluation shows that using a classifier as basis for a similarity measure gives state-of-the-art performance. Finally, the evaluation shows that our fully data-driven similarity measure design outperforms state-of-the-art methods while keeping training time low.

...read moreread less

34 citations

Journal Article•DOI•

[...]

Nazeeh Ghatasheh¹, Hossam Faris¹, Ruba Abukhurma¹, Pedro A. Castillo², Nailah Al-Madi³, Antonio M. Mora², Ala' M. Al-Zoubi², Ahmad B. A. Hassanat⁴, Ahmad B. A. Hassanat⁵ - Show less +5 more•Institutions (5)

University of Jordan¹, University of Granada², Princess Sumaya University for Technology³, University of Tabuk⁴, Mutah University⁵

Ranking-based MCDM models in financial management applications: analysis and emerging challenges

TL;DR: Cost-sensitive random forests over-performed other approaches in predicting bankruptcy, achieving a geometric mean of 90.7%, 0.094 and 0.088 type I & type II errors, respectively.

...read moreread less

Abstract: Bankruptcy is an issue of interest in the business world since decades. It is a crucial endeavor for survival to predict this phenomenon in periods of economic turmoil and recession. In fact, bankruptcy modeling is challenging due to the complexity of contributing factors and the highly imbalanced distribution of available data sets. This work aims at improving the prediction power of bankruptcy modeling, by applying cost-sensitive ensemble methods on a real-world Spanish bankruptcy data set to generate prediction models. The performance of the prediction models is highly competitive in comparison with the related research in the field. Cost-sensitive random forests over-performed other approaches in predicting bankruptcy, achieving a geometric mean of 90.7%, 0.094 and 0.088 type I & type II errors, respectively.

...read moreread less

22 citations

Journal Article•DOI•

[...]

A. I. Marqués¹, Vicente García², J. Salvador Sánchez¹•Institutions (2)

James I University¹, Universidad Autónoma de Ciudad Juárez²

Algorithm of adaptation of electronic document management system based on machine learning technology

TL;DR: This paper aims to provide an in-depth presentation of the contributions of multi-attribute value-based and outranking relations methods to a group of relevant financial applications in the period 2000–2018, putting the emphasis on the state-of-the-art developments and identifying open questions and critical challenges that deserve further research efforts.

...read moreread less

Abstract: Over the last decades, the academic and professional communities have paid much attention toward the use of multi-criteria decision-making methods in a range of business and financial problems due to the variety and complexity of their decisions. Within this branch of operations research, the value-based and outranking relations approaches stand as two of the most powerful methodologies for decision-makers and analysts to produce accurate predictions and consistent evaluations in financial decision-making problems. This paper aims to provide an in-depth presentation of the contributions of multi-attribute value-based and outranking relations methods to a group of relevant financial applications in the period 2000–2018, putting the emphasis on the state-of-the-art developments and identifying open questions and critical challenges that deserve further research efforts.

...read moreread less

21 citations

Journal Article•DOI•

[...]

Artem Obukhov¹, Mikhail Krasnyanskiy¹, Maxim Nikolyukin¹•Institutions (1)

Tambov State Technical University¹

Multi-level diversification approach of semantic-based image retrieval results

TL;DR: The application of machine learning methods for the formation and adaptation of EDMS interface allows you to automate the process of personalizing it to the user’s individual characteristics, increase the system's flexibility and provide the best user experience at the first interaction with EDMS based on the intelligent analysis of data about other users.

...read moreread less

Abstract: The topical problem in the development of electronic document management systems (EDMS) is their adaptation and personalization to the individual characteristics of the user. This article discusses the issue of development of an adaptation algorithm using machine learning methods for solving the problem of structural-parametric synthesis of EDMS. In the framework of the presented algorithm, the approaches to the formalization of workflow processes, ways to adapt the interface to the user parameters using artificial neural networks and a comprehensive assessment of the system’s adaptability are considered. The scientific novelty of the approach consists in the algorithmic and software development for automation of the data collection, analysis and interface adaptation through the use and integration of neural networks in the information system. The application of machine learning methods for the formation and adaptation of EDMS interface allows you to automate the process of personalizing it to the user’s individual characteristics, increase the system’s flexibility and provide the best user experience at the first interaction with EDMS based on the intelligent analysis of data about other users. The main scientific results obtained in the article include: formalized criteria for adapting EDMS; algorithm for designing and adapting EDMS; and development of software for adapting EDMS, including a trained neural network and API.

...read moreread less

13 citations

Journal Article•DOI•

[...]

Mariam Bouchakwa¹, Yassine Ayadi¹, Ikram Amous¹•Institutions (1)

University of Sfax¹

A systematic literature review of the SBSE research community in Spain

TL;DR: A Tag - based Query Semantic Reformulation process, which aims at reformulating the tag-based users’ queries, according to multiple semantic facets of the different images’ views, by using a set of predefined ontological semantic rules is proposed.

...read moreread less

Abstract: With the increasing popularity of social photograph-sharing Web sites, a huge mass of digital images, associated with a set of tags voluntarily introduced by amateur photographers, is daily hosted and consequently, the Tag-based social Image Retrieval technique has been widely adopted. However, tag-based queries are often too ambiguous and abstract to be considered as an efficient solution for the retrieval of the most relevant images that meet the users’ needs. As an alternative, the Semantic-based social Image Retrieval technique has emerged for the purpose of retrieving the relevant images covering as much possible the topics that a given ambiguous query (q) may have. Actually, the diversification strategies are a great challenge for researchers. In this context, we jointly investigate two processes at the ambiguous query preprocessing and postprocessing levels. On the one hand, we propose a Tag-based Query Semantic Reformulation process, which aims at reformulating the tag-based users’ queries, according to multiple semantic facets of the different images’ views, by using a set of predefined ontological semantic rules. On the other hand, we propose a Multi-level Image Diversification process that can first perform a two-level-based image clustering offline, and second, filter and re-rank the image cluster retrieval results according to their pertinence versus the reformulated query online. The experimental results and statistical analysis performed on a collection of 25.000 socio-tagged images shared on Flickr demonstrate the effectiveness of the proposed technique, which is compared with the research technique based on one-level-based image clustering, tag-based image research technique and recent CBIR techniques.

...read moreread less

12 citations

Journal Article•DOI•

[...]

Aurora Ramírez¹, Pedro Delgado-Pérez², Javier Ferrer³, José Raúl Romero¹, Inmaculada Medina-Bulo², Francisco Chicano³ - Show less +2 more•Institutions (3)

University of Córdoba (Spain)¹, University of Cádiz², University of Málaga³

On the performance of empirical mode decomposition-based replay spoofing detection in speaker verification systems

TL;DR: A protocol is proposed to describe the review process, including the search sources, inclusion and exclusion criteria of candidate papers, the data extraction procedure and the categorisation of primary studies, which gives a precise picture of the current research state of the community, trends and future challenges.

...read moreread less

Abstract: Since its appearance in 2001, search-based software engineering has allowed software engineers to use optimisation techniques to automate distinctive human problems related to software management and development. The scientific community in Spain has not been alien to these advances. Their contributions cover both the optimisation of software engineering tasks and the proposal of new search algorithms. This review compiles the research efforts of this community in the area. With this aim, we propose a protocol to describe the review process, including the search sources, inclusion and exclusion criteria of candidate papers, the data extraction procedure and the categorisation of primary studies. After retrieving more than 3700 papers, 232 primary studies have been selected, whose analysis gives a precise picture of the current research state of the community, trends and future challenges. With 145 authors from 19 distinct institutions, results show that a diversity of tasks, including software planning, requirements, design and testing, and a large variety of techniques has been used, from exact search to evolutionary computation and swarm intelligence. Further, since 2015, specific scientific events have helped to bring together the community, improving collaborations, financial funding and internationalisation.

...read moreread less

5 citations

Journal Article•DOI•

[...]

Sapan H. Mankad¹, Sanjay Garg¹•Institutions (1)

Nirma University of Science and Technology¹

Bearing fault diagnostic using machine learning algorithms

TL;DR: An empirical mode decomposition (EMD)-based replay spoofing detection system is presented and it is shown that there is a potential in initial IMFs to carry replay attack patterns, and that is sufficient rather than processing the entire signal.

...read moreread less

Abstract: Automatic speaker verification (ASV) systems have maximum threat from replay spoofing attacks. High frequency regions of the underlying audio signal exhibit the phenomenon about their presence. It is therefore useful to decompose the underlying audio signal into frequency bands or regions for possible analysis. In this paper, an empirical mode decomposition (EMD)-based replay spoofing detection system is presented. Using EMD, each signal is decomposed into several monotonic intrinsic mode functions (IMFs). The signal is reconstructed and represented using one or more subsets of these IMFs by performing different combinations for spoofing detection. Results on ASVspoof 2017 version 2.0 and AVspoof benchmark replay attack datasets indicate that there is a potential in initial IMFs to carry replay attack patterns, and that is sufficient rather than processing the entire signal. The proposed approach can also serve as a preprocessing technique by employing dimension reduction strategy. Cross-corpus experiments on the systems indicate the limitations of ASV antispoofing systems due to mismatched conditions.

...read moreread less

4 citations

Journal Article•DOI•

[...]

Laith Sawaqed¹, Ayman M. Alrayes¹•Institutions (1)

Jordan University of Science and Technology¹

Large-width machine learning algorithm

TL;DR: Results showed the potential of motor current signal in bearing fault diagnosis with high classification accuracy and the possibility to provide a promised diagnostic model that can diagnose bearings of real faults with different fault severities using MCS.

...read moreread less

Abstract: This study aims to enhance the condition monitoring of external ball bearings using the raw data provided by Paderborn University which provided sufficient data for motor current signal MCS. Three classes of bearings have been used: healthy bearings, bearings with an inner race defect, and bearings with outer race defect. Online data at different operating conditions, bearings, and faults extent of artificial and real damages have been chosen to provide the generalization and robustness of the model. After proper preprocessing to the raw data of vibration and MCS, time, frequency, and time–frequency domain features have been extracted. Then, optimal features have been selected using genetic algorithm. Artificial neural network with optimized structure using genetic algorithm has been implemented. A comparison between the performance of vibration and motor current signal has been presented. Moreover, our results are compared to previous work by using the same raw data. Results showed the potential of motor current signal in bearing fault diagnosis with high classification accuracy. Moreover, the results showed the possibility to provide a promised diagnostic model that can diagnose bearings of real faults with different fault severities using MCS.

...read moreread less

Journal Article•DOI•

[...]

Martin Anthony¹, Joel Ratsaby²•Institutions (2)

London School of Economics and Political Science¹, Ariel University²

A multi-agent-based algorithm for data clustering

TL;DR: An algorithm is introduced, called Large Width (LW), that produces a multi-category classifier (defined on a distance space) with the property that the classifier has a large ‘sample width’ (Width is a notion similar to classification margin.)

...read moreread less

Abstract: We introduce an algorithm, called Large Width (LW), that produces a multi-category classifier (defined on a distance space) with the property that the classifier has a large ‘sample width.’ (Width is a notion similar to classification margin.) LW is an incremental instance-based (also known as ‘lazy’) learning algorithm. Given a sample of labeled and unlabeled examples, it iteratively picks the next unlabeled example and classifies it while maintaining a large distance between each labeled example and its nearest-unlike prototype. (A prototype is either a labeled example or an unlabeled example which has already been classified.) Thus, LW gives a higher priority to unlabeled points whose classification decision ‘interferes’ less with the labeled sample. On a collection UCI benchmark datasets, the LW algorithm ranks at the top when compared to 11 instance-based learning algorithms (or configurations). When compared to the best candidate from instance-based learners, MLP, SVM, decision tree learner (C4.5) and Naive Bayes, LW is ranked at second place after only MLP which comes at first place by a single extra win against LW. The LW algorithm can be implemented in parallel distributed processing to yield a high speedup factor and is suitable for any distance space, with a distance function which need not necessarily satisfy the conditions of a metric.

...read moreread less

Journal Article•DOI•

[...]

Lutiele M. Godois¹, Diana Francisca Adamatti¹, Leonardo Ramos Emmendorfer¹•Institutions (1)

University of Rio Grande¹

A multi-objective interactive dynamic particle swarm optimizer

TL;DR: This work presents the implementation and evaluation of a clustering algorithm based on a multi-agent system, which automatically detects the number of groups and the group labels for a given dataset.

...read moreread less

Abstract: Clustering algorithms aim to detect groups based on similarity, from a given set of objects. Many clustering techniques have been proposed, most requiring the user to set critical parameters, such as the number of groups. This work presents the implementation and evaluation of a clustering algorithm based on a multi-agent system, which automatically detects the number of groups and the group labels for a given dataset. Groups formed during the clustering process emerge as patterns from the interaction among agents. The proposed algorithm is experimentally validated over benchmark datasets from the literature. The quality of clustering results is computed using seven internal indexes and one external index. Under this methodology, the proposed algorithm is compared to K-means and DBSCAN (density-based spatial clustering of applications with noise).

...read moreread less

Journal Article•DOI•

[...]

Cristóbal Barba-González¹, Antonio J. Nebro¹, José García-Nieto¹, José F. Aldana-Montes¹•Institutions (1)

University of Málaga¹

DMRAE: discriminative manifold regularized auto-encoder for sparse and robust feature learning

TL;DR: SMPSO/RPD is proposed, an algorithm that provides the search capabilities of SMPSO, incorporates an interactive preference articulation mechanism based on defining one or more reference points, and is able to deal with dynamic problems.

...read moreread less

Abstract: Multi-objective optimization deals with problems having two or more conflicting objectives that have to be optimized simultaneously. When the objectives change somehow with time, the problems become dynamic, and if the decision maker indicates preferences at runtime, then the algorithms to solve them become interactive. In this paper, we propose the integration of SMPSO/RP, an interactive multi-objective particle swarm optimizer based on SMPSO, with InDM2, an algorithmic template for dynamic interactive optimization with metaheuristics. The result is SMPSO/RPD, an algorithm that provides the search capabilities of SMPSO, incorporates an interactive preference articulation mechanism based on defining one or more reference points, and is able to deal with dynamic problems. We conduct a qualitative study showing the working of SMPSO/RPD on three benchmark problems, remaining a qualitative analysis as an open line of future research.

...read moreread less

Journal Article•DOI•

[...]

Nima Farajian¹, Peyman Adibi²•Institutions (2)

University of Kashan¹, University of Isfahan²

16 Jul 2020-Progress in Artificial Intelligence

TL;DR: The combination of triplet loss manifold regularization with a novel denoising regularizer is injected to the objective function to generate features which are robust against perpendicular perturbation around data manifold and are sensitive enough to variation along the manifold.

...read moreread less

Abstract: Although the regularized over-complete auto-encoders have shown great ability to extract meaningful representation from data and reveal the underlying manifold of them, their unsupervised learning nature prevents the consideration of class distinction in the representations. The present study aimed to learn sparse, robust, and discriminative features through supervised manifold regularized auto-encoders by preserving locality on the manifold directions around each data and enhancing between-class discrimination. The combination of triplet loss manifold regularization with a novel denoising regularizer is injected to the objective function to generate features which are robust against perpendicular perturbation around data manifold and are sensitive enough to variation along the manifold. Also, the sparsity ratio of the obtained representation is adaptive based on the data distribution. The experimental results on 12 real-world classification problems show that the proposed method has better classification performance in comparison with several recently proposed relevant models.

...read moreread less

Journal Article•DOI•

An unsupervised keyphrase extraction model by incorporating structural and semantic information

[...]

Linkai Luo¹, Longmin Zhang¹, Hong Peng¹•Institutions (1)

Xiamen University¹

GDTM: Graph-based Dynamic Topic Models

TL;DR: The comparison experiments show that the proposed unsupervised keyphrase extraction model achieves the best results in the long documents and a competitive result in the short document, indicating that the model is effective and is superior to the state-of-the-art un supervised models.

...read moreread less

Abstract: We proposed an unsupervised keyphrase extraction model that incorporates the structural information and the semantic information of a document. The structural information refers to the directed graph that is composed of keyphrase candidates and topics. The weight between two candidates is computed by their relative distance in the document and the positions of the corresponding sentences. Graph ranking algorithm is then applied to get the structural scores of the candidates. Then, the semantic score is obtained by the similarity between candidate and all sentences. The final score of a candidate is the sum of the structural score and the semantic score. The top N candidates with the highest scores are selected as the recommended keyphrases. The comparison experiments on three widely used datasets show that our model achieves the best results in the long documents and a competitive result in the short document. It indicates that our model is effective and is superior to the state-of-the-art unsupervised models.

...read moreread less

Journal Article•DOI•

[...]

Kambiz Ghoorchian¹, Magnus Sahlgren•Institutions (1)

Royal Institute of Technology¹

Improving recommender systems by encoding items and user profiles considering the order in their consumption history

TL;DR: GDTM, a single-pass graph-based DTM algorithm, that combines a context-rich and incremental feature representation method with graph partitioning to address scalability and dynamicity and uses a rich language model to account for sparsity is presented.

...read moreread less

Abstract: Dynamic Topic Modeling (DTM) is the ultimate solution for extracting topics from short texts generated in Online Social Networks (OSNs) like Twitter. It requires to be scalable and to be able to account for sparsity and dynamicity of short texts. Current solutions combine probabilistic mixture models like Dirichlet Multinomial or Pitman-Yor Process with approximate inference approaches like Gibbs Sampling and Stochastic Variational Inference to, respectively, account for dynamicity and scalability of DTM. However, these methods basically rely on weak probabilistic language models, which do not account for sparsity in short texts. In addition, their inference is based on iterative optimizations, which have scalability issues when it comes to DTM. We present GDTM, a single-pass graph-based DTM algorithm, to solve the problem. GDTM combines a context-rich and incremental feature representation method with graph partitioning to address scalability and dynamicity and uses a rich language model to account for sparsity. We run multiple experiments over a large-scale Twitter dataset to analyze the accuracy and scalability of GDTM and compare the results with four state-of-the-art models. In result, GDTM outperforms the best model by $$11\%$$ on accuracy and performs by an order of magnitude faster while creating four times better topic quality over standard evaluation metrics.

...read moreread less

Journal Article•DOI•

[...]

Pablo Pérez-Núñez¹, Oscar Luaces¹, Antonio Bahamonde¹, Jorge Díez¹•Institutions (1)

Artificial Intelligence Center¹

Fixed versus variable time window warehousing strategies in real time

TL;DR: Several encoding strategies based on neural networks are analyzed and applied and show that the order in which the musical pieces were listened to is relevant for the codification of items (songs), and that the encoding of user profiles should use a different amount of historical data depending on the learning task to be solved.

...read moreread less

Abstract: The aim of Recommender Systems is to suggest items (products) to satisfy each user’s particular taste. Representation strategies play a very important role in these systems, as an adequate codification of users and items is expected to ease the induction of a model which synthesizes their tastes and make better recommendations. However, in addition to gathering information about users’ tastes, there is an additional aspect that can be relevant for a proper codification strategy, namely the order in which the user interacted with the items. In this paper, several encoding strategies based on neural networks are analyzed and applied to solve two different recommendation tasks in the context of music playlists. The results show that the order in which the musical pieces were listened to is relevant for the codification of items (songs). We also find that the encoding of user profiles should use a different amount of historical data depending on the learning task to be solved. In other words, we do not always have to use all the available data; sometimes, it is better to discard old information, as tastes change over time.

...read moreread less

Journal Article•DOI•

[...]

Sergio Gil-Borrás¹, Eduardo G. Pardo², Antonio Alonso-Ayuso², Abraham Duarte²•Institutions (2)

Technical University of Madrid¹, King Juan Carlos University²

Gabriel graph-based connectivity and density for internal validity of clustering

TL;DR: This paper deals with a subtask arising within the picking task in a warehouse, when the picking policy follows the order batching strategy and orders are received online.

...read moreread less

Abstract: Warehousing includes many different regular activities such as receiving, batching, picking, packaging, and shipping goods. Several authors indicate that the picking operation might consume up to 55% of the total operational costs. In this paper, we deal with a subtask arising within the picking task in a warehouse, when the picking policy follows the order batching strategy (i.e., orders are grouped into batches before being collected) and orders are received online. Particularly, once the batches have been compiled it is necessary to determine the moment in the time when the picker starts collecting each batch. The waiting time of the picker before starting to collect the next available batch is usually known as time window. In this paper, we compare the performance of two different time window strategies: Fixed Time Window and Variable Time Window. Since those strategies cannot be tested in isolation, we have considered: two different batching algorithms (First Come First Served and a Greedy algorithm based on weight); one routing algorithm (S-Shape); and a greedy selection algorithm for choosing the next batch to collect based on the weight.

...read moreread less

Journal Article•DOI•

[...]

Fatima Boudane¹, Ali Berrichi¹•Institutions (1)

University of Boumerdes¹

An analysis of technological frameworks for data streams

TL;DR: The experimental results on synthetic and real datasets, using the well-known neighborhood-based clustering (NBC) algorithm and the DBSCAN (density-based spatial clustering of applications with noise) algorithm, illustrate the superiority of the proposed index over some classical and recent indices and show its effectiveness for the evaluation of clustering algorithms and the selection of their appropriate parameters.

...read moreread less

Abstract: Clustering has an important role in data mining field. However, there is a large variety of clustering algorithms and each could generate quite different results depending on input parameters. In the research literature, several cluster validity indices have been proposed to evaluate clustering results and find the partition that best fits the input dataset. However, these validity indices may fail to achieve satisfactory results, especially in case of clusters with arbitrary shapes. In this paper, we propose a new cluster validity index for density-based, arbitrarily shaped clusters. Our new index is based on the density and connectivity relations extracted among the data points, based on the proximity graph, Gabriel graph. The incorporation of the connectivity and density relations allows achieving the best clustering results in the case of clusters with any shape, size or density. The experimental results on synthetic and real datasets, using the well-known neighborhood-based clustering (NBC) algorithm and the DBSCAN (density-based spatial clustering of applications with noise) algorithm, illustrate the superiority of the proposed index over some classical and recent indices and show its effectiveness for the evaluation of clustering algorithms and the selection of their appropriate parameters.

...read moreread less

Journal Article•DOI•

[...]

Fernando Puentes¹, M.D. Pérez-Godoy¹, Pedro González¹, María José del Jesus¹•Institutions (1)

University of Jaén¹

Utilizing external corpora through kernel function: application in biomedical named entity recognition

TL;DR: This paper analyzes some open-source technological frameworks available for data streams, detailing their main characteristics, and makes a performance and latency comparison between Spark Streaming, Spark Structured Streaming, Storm, Flink and Samza following the Yahoo Streaming Benchmark methodology.

...read moreread less

Abstract: Real-time data analysis is becoming increasingly important in Big Data environments for addressing data stream issues To this end, several technological frameworks have been developed, both open-source and proprietary, for the analysis of streaming data This paper analyzes some open-source technological frameworks available for data streams, detailing their main characteristics The objective is to facilitate decisions on which framework to use, meeting the needs of data mining methods for data streams In this sense, there are important factors affecting the choice about which framework is most suitable for this purpose Some of these factors are the existence of data mining libraries, the available documentation, the maturity of the platform, fault tolerance and processing guarantees, among others Another decisive factor when choosing a data stream framework is its performance For this reason, two comparisons have been made: a performance and latency comparison between Spark Streaming, Spark Structured Streaming, Storm, Flink and Samza following the Yahoo Streaming Benchmark methodology, and a comparison between Spark Streaming and Flink with a clustering algorithm for data streaming called streaming K-means

...read moreread less

Journal Article•DOI•

[...]

Rakesh Patra¹, Sujan Kumar Saha¹•Institutions (1)

Birla Institute of Technology, Mesra¹

Comparing two multinomial samples using hierarchical Bayesian models

TL;DR: This paper aims to compute similarity between the words using their context information, syntactic information and occurrence statistics in external corpora through a kernel function that combines two sub-kernels.

...read moreread less

Abstract: Performance of word sequential labelling tasks like named entity recognition and parts-of-speech tagging largely depends on the features chosen in the task. But, in general representing a word as well as capturing its characteristics properly through a set of features is quite difficult. Moreover, external resources often become essential in order to build a high-performance system. But, acquiring required knowledge demands domain-specific processing and feature engineering. Kernel functions along with support vector machine may offer an alternative way to more efficiently capture similarity between words using both the local context and the external corpora. In this paper, we aim to compute similarity between the words using their context information, syntactic information and occurrence statistics in external corpora. This similarity value is gathered through a kernel function. The proposed kernel function combines two sub-kernels. One of these captures global information through words co-occurrence statistics accumulated from a large corpora. The second kernel captures local semantic information of the words through word specific parse tree fragmentation. We test this proposed kernel using JNLPBA 2004 Biomedical Named Entity Recognition and BioCreative II 2006 Gene Mention Recognition task data-sets. In our experiments, we observe that the proposed method is effective on both the data-sets.

...read moreread less

Journal Article•DOI•

[...]

Andrés R. Masegosa¹, Antonio Torres¹, María Morales¹, Antonio Salmerón¹•Institutions (1)

University of Almería¹

Coaching: accelerating reinforcement learning through human-assisted approach

TL;DR: This paper applies a class of Bayesian models that have been successfully used in streaming data context, to the problem of comparing multinomial populations, and shows how it is possible, by means of a relevant parameter, to decide whether two populations are different or not.

...read moreread less

Abstract: Two-sample statistical tests are commonly used when deciding whether two samples can be considered to be drawn from the same population. However, statistical tests face problems when confronted to situations involving extremely large volumes of data, in which case the power of the test is so high that they reject the null hypothesis even if the differences found in the data are minimal. Furthermore, the fact that they may require to explore the whole sample each time they are applied is a serious limitation, for instance, in streaming data contexts. In this paper, we apply a class of Bayesian models that have been successfully used in streaming data context, to the problem of comparing multinomial populations. The underlying tool is latent variable models with hierarchical power priors. We show how it is possible, by means of a relevant parameter, to decide whether two populations are different or not.

...read moreread less

Journal Article•DOI•

[...]

Nakarin Suppakun¹, Thavida Maneewarn¹•Institutions (1)

King Mongkut's University of Technology Thonburi¹