scispace - formally typeset
Search or ask a question

Showing papers presented at "International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management in 2015"


Proceedings ArticleDOI
12 Nov 2015
TL;DR: This paper addresses the issue of finding an efficient ontology evaluation method by presenting the existing ontology Evaluation techniques, while discussing their advantages and drawbacks.
Abstract: Ontologies nowadays have become widely used for knowledge representation, and are considered as foundation for Semantic Web. However with their wide spread usage, a question of their evaluation increased even more. This paper addresses the issue of finding an efficient ontology evaluation method by presenting the existing ontology evaluation techniques, while discussing their advantages and drawbacks. The presented ontology evaluation techniques can be grouped into four categories: gold standard-based, corpus-based, task-based and criteria based approaches.

110 citations


Proceedings ArticleDOI
01 Jan 2015
TL;DR: The findings suggest that universities generally encourage and facilitate the transfer of tacit knowledge; however there are some areas that require improvement.
Abstract: The purpose of this paper is to explore whether Australian universities encourage tacit knowledge transfer. In doing so, the paper also explores the role of managers (academics’ supervisor) in promoting or hampering tacit knowledge transfer and the value given to new ideas and innovation. This study collected data by conducting interviews of academics in four universities and a qualitative narrative analysis was carried out. The findings suggest that universities generally encourage and facilitate the transfer of tacit knowledge; however there are some areas that require improvement. Avenues for improving tacit knowledge transfer call for open communication, peer-trust and unrestricted sharing of knowledge by managers. The study was conducted in four universities, hence limits the generalisability of the findings. This paper will contribute to further research in the discipline of tacit knowledge, provide understanding and guide universities in their tacit knowledge transfer efforts and in particular, encourage the transfer of tacit knowledge.

70 citations


Proceedings ArticleDOI
12 Nov 2015
TL;DR: The SCUT hybrid sampling method is proposed, which is used to balance the number of training examples in a multi-class setting and, when the SCUT method is used for pre-processing the data before classification, it obtain highly accurate models that compare favourably to the state-of-the-art.
Abstract: Class imbalance is a crucial problem in machine learning and occurs in many domains. Specifically, the two-class problem has received interest from researchers in recent years, leading to solutions for oil spill detection, tumour discovery and fraudulent credit card detection, amongst others. However, handling class imbalance in datasets that contains multiple classes, with varying degree of imbalance, has received limited attention. In such a multi-class imbalanced dataset, the classification model tends to favour the majority classes and incorrectly classify instances from the minority classes as belonging to the majority classes, leading to poor predictive accuracies. Further, there is a need to handle both the imbalances between classes as well as address the selection of examples within a class (i.e. the so-called within class imbalance). In this paper, we propose the SCUT hybrid sampling method, which is used to balance the number of training examples in such a multi-class setting. Our SCUT approach oversamples minority class examples through the generation of synthetic examples and employs cluster analysis in order to undersample majority classes. In addition, it handles both within-class and between-class imbalance. Our experimental results against a number of multi-class problems show that, when the SCUT method is used for pre-processing the data before classification, we obtain highly accurate models that compare favourably to the state-of-the-art.

70 citations


Proceedings ArticleDOI
12 Nov 2015
TL;DR: It is hypothesized that soft exponential has the potential to improve neural network learning, as it can exactly calculate many natural operations that typical neural networks can only approximate, including addition, multiplication, inner product, distance, and sinusoids.
Abstract: We present the soft exponential activation function for artificial neural networks that continuously interpolates between logarithmic, linear, and exponential functions. This activation function is simple, differentiable, and parameterized so that it can be trained as the rest of the network is trained. We hypothesize that soft exponential has the potential to improve neural network learning, as it can exactly calculate many natural operations that typical neural networks can only approximate, including addition, multiplication, inner product, distance, and sinusoids.

38 citations


Proceedings ArticleDOI
12 Nov 2015
TL;DR: A model based approach is proposed which comprise of initial pre-processing step followed by imperative attributes selection and classification is achieved using association rule mining and has the potential to distinguish AD from healthy controls.
Abstract: Alzheimer's, an irreparable brain disease, impairs thinking and memory while the aggregate mind size shrinks which at last prompts demise. Early diagnosis of AD is essential for the progress of more prevailing treatments. Machine learning (ML), a branch of artificial intelligence, employs a variety of probabilistic and optimization techniques that permits PCs to gain from vast and complex datasets. As a result, researchers focus on using machine learning frequently for diagnosis of early stages of AD. This paper presents a review, analysis and critical evaluation of the recent work done for the early detection of AD using ML techniques. Several methods achieved promising prediction accuracies, however they were evaluated on different pathologically unproven data sets from different imaging modalities making it difficult to make a fair comparison among them. Moreover, many other factors such as pre-processing, the number of important attributes for feature selection, class imbalance distinctively affect the assessment of the prediction accuracy. To overcome these limitations, a model is proposed which comprise of initial pre-processing step followed by imperative attributes selection and classification is achieved using association rule mining. Furthermore, this proposed model based approach gives the right direction for research in early diagnosis of AD and has the potential to distinguish AD from healthy controls.

34 citations


Proceedings ArticleDOI
01 Nov 2015
TL;DR: New ways and methods to improve and maximize classification performance, especially to enhance precision and reduce false positives, thorough examination and handling of the issues with class imbalance, and through incorporation of LDA topic models are explored.
Abstract: Today, the presence of harmful and inappropriate content on the web still remains one of the most primary concerns for web users. Web classification models in the early days are limited by the methods and data available. In our research we revisit the web classification problem with the application of new methods and techniques for text content analysis. Our recent studies have indicated the promising potential of combing topic analysis and sentiment analysis in web content classification. In this paper we further explore new ways and methods to improve and maximize classification performance, especially to enhance precision and reduce false positives, thorough examination and handling of the issues with class imbalance, and through incorporation of LDA topic models.

32 citations


Proceedings ArticleDOI
12 Nov 2015
TL;DR: Analysing Arabic, Saudi dialect Twitter tweets to extract sentiments toward a specific topic using a dataset consisting of 3000 tweets collected in three domains confirms the superiority of the hybrid learning approach over the supervised and unsupervised approaches.
Abstract: Harvesting meaning out of massively increasing data could be of great value for organizations. Twitter is one of the biggest public and freely available data sources. This paper presents a Hybrid learning implementation to sentiment analysis combining lexicon and supervised approaches. Analysing Arabic, Saudi dialect Twitter tweets to extract sentiments toward a specific topic. This was done using a dataset consisting of 3000 tweets collected in three domains. The obtained results confirm the superiority of the hybrid learning approach over the supervised and unsupervised approaches.

27 citations


Proceedings ArticleDOI
12 Nov 2015
TL;DR: This research presents an implementation of a sentiment analyser for Twitter's tweets, which is one of the biggest public and freely available big data sources, and analyses Arabic, Saudi dialect tweets to extract sentiments toward a specific topic.
Abstract: Data has become the currency of this era and it is continuing to massively increase in size and generation rate. Large data generated out of organisations' e-transactions or individuals through social networks could be of a great value when analysed properly. This research presents an implementation of a sentiment analyser for Twitter's tweets which is one of the biggest public and freely available big data sources. It analyses Arabic, Saudi dialect tweets to extract sentiments toward a specific topic. It used a dataset consisting of 3000 tweets collected from Twitter. The collected tweets were analysed using two machine learning approaches, supervised which is trained with the dataset collected and the proposed hybrid learning which is trained on a single words dictionary. Two algorithms are used, Support Vector Machine (SVM) and K-Nearest Neighbors (KNN). The obtained results by the cross validation on the same dataset clearly confirm the superiority of the hybrid learning approach over the supervised approach.

22 citations


Proceedings ArticleDOI
12 Nov 2015
TL;DR: This work studied the existing relationship between Bitcoin's trading volumes and the queries volumes of Google search engine, achieving significant cross correlation values, demonstrating search volumes power to anticipate trading volumes of Bitcoin currency.
Abstract: In the last decade, Web 2.0 services have been widely used as communication media. Due to the huge amount of available information, searching has become dominant in the use of Internet. Millions of users daily interact with search engines, producing valuable sources of interesting data regarding several aspects of the world. Search queries prove to be a useful source of information in financial applications, where the frequency of searches of terms related to the digital currency can be a good measure of interest in it. Bitcoin, a decentralized electronic currency, represents a radical change in financial systems, attracting a large number of users and a lot of media attention. In this work we studied the existing relationship between Bitcoin's trading volumes and the queries volumes of Google search engine. We achieved significant cross correlation values, demonstrating search volumes power to anticipate trading volumes of Bitcoin currency.

22 citations


Proceedings ArticleDOI
12 Nov 2015
TL;DR: An experience for importing, querying and visualizing graph database and in particular, the WordNet database is described as a case study using Neo4J and Cytoscape to simplify the large-scale visualization of WordNet.
Abstract: In the Big Data era, the visualization of large data sets is becoming an increasingly relevant task due to the great impact that data have from a human perspective. Since visualization is the closer phase to the users within the data life cycle's phases, there is no doubt that an effective, efficient and impressive representation of the analyzed data may result as important as the analytic process itself. This paper presents an experience for importing, querying and visualizing graph database and in particular, we describe as a case study the WordNet database using Neo4J and Cytoscape. We will describe each step in this study focusing on the used strategies for overcoming the different problems mainly due to the intricate nature of the case study. Finally, an attempt to define some criteria to simplify the large-scale visualization of WordNet will be made, providing some examples and considerations which have arisen.

20 citations


Proceedings ArticleDOI
12 Nov 2015
TL;DR: This paper evaluated the performances of eight popular similarity measures on four levels (degree) of textual similarity using a corpus of plagiarised texts, and showed that most of the measures were equal on highly similar texts, with the exception of Euclidean distance and Jensen-Shannon divergence which had poorer performances.
Abstract: Many Information Retrieval (IR) and Natural language processing (NLP) systems require textual similarity measurement in order to function, and do so with the help of similarity measures. Similarity measures function differently, some measures which work better on highly similar texts do not always do so well on highly dissimilar texts. In this paper, we evaluated the performances of eight popular similarity measures on four levels (degree) of textual similarity using a corpus of plagiarised texts. The evaluation was carried out in the context of candidate selection for plagiarism detection. Performance was measured in terms of recall, and the best performed similarity measure(s) for each degree of textual similarity was identified. Results from our Experiments show that the performances of most of the measures were equal on highly similar texts, with the exception of Euclidean distance and Jensen-Shannon divergence which had poorer performances. Cosine similarity and Bhattacharryan coefficient performed best on lightly reviewed text, and on heavily reviewed texts, Cosine similarity and Pearson Correlation performed best and next best respectively. Pearson Correlation had the best performance on highly dissimilar texts. The results also show term weighing methods and n-gram document representations that best optimises the performance of each of the similarity measures on a particular level of intertextual similarity.

Proceedings ArticleDOI
12 Nov 2015
TL;DR: The experimental results show that the proposed method can be used to effectively classify Biomedical Questions with higher accuracy and was evaluated on the Benchmark datasets of Biomedical questions.
Abstract: Biomedical Question Types (QTs) Classification is an important component of Biomedical Question Answering Systems and it attracted a notable amount of research since the past decade. Biomedical QTs Classification is the task for determining the QTs to a given Biomedical Question. It classifies Biomedical Questions into several Questions Types. Moreover, the Question Types aim to determine the appropriate Answer Extraction Algorithms. In this paper, we have proposed an effective and efficient method for Biomedical QTs Classification. We have classified the Biomedical Questions into three broad categories. We have also defined the Syntactic Patterns for particular category of Biomedical Questions. Therefore, using these Biomedical Question Patterns, we have proposed an algorithm for classifying the question into particular category. The proposed method was evaluated on the Benchmark datasets of Biomedical Questions. The experimental results show that the proposed method can be used to effectively classify Biomedical Questions with higher accuracy.

Proceedings ArticleDOI
12 Nov 2015
TL;DR: This paper discovers what is currently known about Information Systems and Systems of Systems, and proceeds towards suggesting an architecture of a System of Information Systems that integrates several Information systems and allows information to be transferred at ease between those different components.
Abstract: Information Systems are viewed as a set of services creating a workflow of information directed to specific groups and members. This allows individuals to share ideas and their talents with other members. In such manner, tasks can be carried out both efficiently and effectively. Due to the nature of Information Systems that revolves around creating information useful to users, and in some higher forms of Information Systems creating knowledge, management of information and/or knowledge is part of their functionalities. In this paper we aim to study the placement of Information Systems as part of a System of Systems (SoS), as these large systems poses significant technical improvement in terms of information interoperability that overcomes conceptual and technical barriers. Therefore, we move towards defining and modeling System of Information Systems (SoIS). This paper discovers what is currently known about Information Systems and Systems of Systems, and proceeds towards suggesting an architecture of a System of Information Systems that integrates several Information Systems and allows information to be transferred at ease between those different components.

Proceedings ArticleDOI
12 Nov 2015
TL;DR: This paper introduces a simpler method based on the Markov chain theory to accomplish both transfer learning and sentiment classification tasks, which requires a lower parameter calibration effort.
Abstract: Sentiment classification of textual opinions in positive, negative or neutral polarity, is a method to understand people thoughts about products, services, persons, organisations, and so on. Interpreting and labelling opportunely text data polarity is a costly activity if performed by human experts. To cut this labelling cost, new cross domain approaches have been developed where the goal is to automatically classify the polarity of an unlabelled target text set of a given domain, for example movie reviews, from a labelled source text set of another domain, such as book reviews. Language heterogeneity between source and target domain is the trickiest issue in cross-domain setting so that a preliminary transfer learning phase is generally required. The best performing techniques addressing this point are generally complex and require onerous parameter tuning each time a new source-target couple is involved. This paper introduces a simpler method based on the Markov chain theory to accomplish both transfer learning and sentiment classification tasks. In fact, this straightforward technique requires a lower parameter calibration effort. Experiments on popular text sets show that our approach achieves performance comparable with other works.

Proceedings ArticleDOI
12 Nov 2015
TL;DR: A new strategy of merging multiple search engine results using only the user query as a relevance criterion is presented and a new score function combining the similarity between user query and retrieved results and the users' satisfaction toward used search engines is proposed.
Abstract: Meta Search Engines are finding tools developed for improving the search performance by submitting user queries to multiple search engines and combining the different search results in a unified ranked list. The effectiveness of a Meta search engine is closely related to the result merging strategy it employs. But nowadays, the main issue in the conception of such systems is the merging strategy of the returned results. With only the user query as relevant information about his information needs, it's hard to use it to find the best ranking of the merged results. We present in this paper a new strategy of merging multiple search engine results using only the user query as a relevance criterion. We propose a new score function combining the similarity between user query and retrieved results and the users' satisfaction toward used search engines. The proposed Meta search engine can be used for merging search results of any set of search engines.

Book ChapterDOI
12 Nov 2015
TL;DR: This paper presents its experiences in importing, querying and visualizing graph databases taking one of the most spread lexical database as case study: WordNet, and suggests a new visualization strategy for WordNet synonyms rings by exploiting the features and concepts behind tag clouds.
Abstract: Data and Information Visualization is becoming strategic for the exploration and explanation of large data sets due to the great impact that data have from a human perspective. The visualization is the closer phase to the users within the data life cycle’s phases, thus, an effective, efficient and impressive representation of the analyzed data may result as important as the analytic process itself. In this paper, we present our experiences in importing, querying and visualizing graph databases taking one of the most spread lexical database as case study: WordNet. After having defined a meta-model to translate WordNet entities into nodes and arcs inside a labeled oriented graph, we try to define some criteria to simplify the large-scale visualization of WordNet graph, providing some examples and considerations which arise. Eventually, we suggest a new visualization strategy for WordNet synonyms rings by exploiting the features and concepts behind tag clouds.

Proceedings ArticleDOI
12 Nov 2015
TL;DR: An automatic ontology matching approach which brings a final alignment by combining three kinds of different similarity measures: lexical-based, structure- based, and semantic-based techniques as well as using information in ontologies including names, labels, comments, relations and positions of concepts in the hierarchy and integrating WordNet dictionary is presented.
Abstract: This paper presents an automatic ontology matching approach (called LSSOM - Lexical Structural Semantic-based Ontology Matching method) which brings a final alignment by combining three kinds of different similarity measures: lexical-based, structure-based, and semantic-based techniques as well as using information in ontologies including names, labels, comments, relations and positions of concepts in the hierarchy and integrating WordNet dictionary. Firstly, two ontologies are matched sequentially by using the lexical-based and structure-based similarity measures to find structural correspondences among the concepts. Secondly, the semantic similarity based on WordNet dictionary is applied to these concepts in given ontologies. After the semantic and structural similarities are obtained, they are combined in the parallel phase by using weighted sum method to yield the final similarities. Our system is implemented and evaluated based on the OAEI 2008 benchmark dataset. The experimental results show that our approach obtains good F-measure values and outperforms other automatic ontology matching systems which do not use instances information.

Proceedings ArticleDOI
12 Nov 2015
TL;DR: A novel Non-negative Matrix Factorization based on the logistic link function for decomposition of binary data finds that choosing the number of components is an essential part in the modelling and interpretation, that is still unresolved.
Abstract: We propose the Logistic Non-negative Matrix Factorization for decomposition of binary data. Binary data are frequently generated in e.g. text analysis, sensory data, market basket data etc. A common method for analysing non-negative data is the Non-negative Matrix Factorization, though this is in theory not appropriate for binary data, and thus we propose a novel Non-negative Matrix Factorization based on the logistic link function. Furthermore we generalize the method to handle missing data. The formulation of the method is compared to a previously proposed logistic matrix factorization without non-negativity constraint on the features. We compare the performance of the Logistic Non-negative Matrix Factorization to Least Squares Non-negative Matrix Factorization and Kullback-Leibler (KL) Non-negative Matrix Factorization on sets of binary data: a synthetic dataset, a set of student comments on their professors collected in a binary term-document matrix and a sensory dataset. We find that choosing the number of components is an essential part in the modelling and interpretation, that is still unresolved.

Proceedings ArticleDOI
12 Nov 2015
TL;DR: This article presents a bridge concept for connecting engineering data with an OWL-based ontology, and uses an example ontology containing security knowledge of automation systems as an example.
Abstract: Ontologies provide an effective way for describing and using knowledge of a specific domain. In engineering workflows the reusability and quick adoption of knowledge is needed for solving several tasks in efficient ways. Engineering data is mostly structured in hierarchical documents and exchange formats and not represented in ontologies. Therefore a connection between engineering data and the knowledge in ontologies is needed. In this article we present a bridge concept for connecting engineering data with an OWL-based ontology. For this we use an example ontology containing security knowledge of automation systems.

Book ChapterDOI
12 Nov 2015
TL;DR: This work studied whether Bitcoin’s trading volume is related to the web search and social volumes about Bitcoin, and investigated whether public sentiment, expressed in large-scale collections of daily Twitter posts, can be used to predict the Bitcoin market too.
Abstract: In recent years, Internet has completely changed the way real life works. In particular, it has been possible to witness the online emergence of web 2.0 services that have been widely used as communication media. On one hand, services such as blogs, tweets, forums, chats, email have gained wide popularity. On the other hand, due to the huge amount of available information, searching has become dominant in the use of Internet. Millions of users daily interact with search engines, producing valuable sources of interesting data regarding several aspects of the world. Bitcoin, a decentralized electronic currency, represents a radical change in financial systems, attracting a large number of users and a lot of media attention. In this work we studied whether Bitcoin’s trading volume is related to the web search and social volumes about Bitcoin. We investigated whether public sentiment, expressed in large-scale collections of daily Twitter posts, can be used to predict the Bitcoin market too. We achieved significant cross correlation outcomes, demonstrating the search and social volumes power to anticipate trading volumes of Bitcoin currency.

Proceedings ArticleDOI
12 Nov 2015
TL;DR: A state-based hazard analysis process can be used to automate the analysis of preliminary hazard worksheets with the aims of making them more precise, disambiguating causal relationships, and supporting the proper definition of system boundaries.
Abstract: Hazard identification and hazard analysis are difficult and essential parts of safety engineering. These activities are very demanding and mostly manual. There is an increasing need for improved analysis tools and techniques. In this paper we report research that focuses on supporting the early stages of hazard identification. A state-based hazard analysis process is presented to explore dependencies between causes and consequences of hazards. The process can be used to automate the analysis of preliminary hazard worksheets with the aims of making them more precise, disambiguating causal relationships, and supporting the proper definition of system boundaries. An application example is presented for a railway system.

Proceedings ArticleDOI
12 Nov 2015
TL;DR: The hashtags entered by users are analyzed to generate a folksonomy, a user-driven classification of information that is a complex network to which all the typical analysis and evaluations of such a mathematical model can be associated.
Abstract: The Instagram is a social network for smartphones created in 2010 and acquired by Facebook in 2012. It currently has more than 300 million registered users and allows for the immediate upload of images (square, inspired by Polaroid), to which users can associate hashtags and comments. Moreover, connections can be created between users that share the same interests. In our work, we intend to analyze the hashtags entered by users: the use of such hashtags, as it happens in other social networks like Twitter, generates a folksonomy, that is a user-driven classification of information. We intend to map that folksonomy as a complex network to which we can associate all the typical analysis and evaluations of such a mathematical model. Our purpose is to use the resulting complex network as a marketing tool, in order to improve brand or product awareness.

Proceedings ArticleDOI
12 Nov 2015
TL;DR: It is demonstrated that the success of young researchers can be predicted more accurately based on their professional network than their established track records.
Abstract: Finding rising stars in academia early in their careers has many implications when hiring new faculty, applying for promotion, and/or requesting grants. Typically, the impact and productivity of a researcher are assessed by a popular measurement called the h-index that grows linearly with the academic age of a researcher. Therefore, h-indices of researchers in the early stages of their careers are almost uniformly low, making it difficult to identify those who will, in future, emerge as influential leaders in their field. To overcome this problem, we make use of social network analysis to identify young researchers most likely to become successful as measured by their h-index. We assume that the co-authorship graph reveals a great deal of information about the potential of young researchers. We built a social network of 62,886 researchers using the data available in CiteSeerx. We then designed and trained a linear SVM classifier to identify emerging authors based on their personal attributes and/or their networks of co-authors. We evaluated our classifier's ability to predict the future research impact of a set of 26,170 young researchers, those with an h-index of less than or equal to two in 2005. By examining their actual impact six years later, we demonstrate that the success of young researchers can be predicted more accurately based on their professional network than their established track records.

Proceedings ArticleDOI
12 Nov 2015
TL;DR: It is argued that extracting information from this data is better guided by domain knowledge of the targeted use-case and the integration of a knowledge-driven approach with Machine Learning techniques is investigated in order to improve the quality of the Relation Extraction process.
Abstract: The increasing accessibility and availability of online data provides a valuable knowledge source for information analysis and decision-making processes. In this paper we argue that extracting information from this data is better guided by domain knowledge of the targeted use-case and investigate the integration of a knowledge-driven approach with Machine Learning techniques in order to improve the quality of the Relation Extraction process. Targeting the financial domain, we use Semantic Web Technologies to build the domain Knowledgebase, which is in turn exploited to collect distant supervision training data from semantic linked datasets such as DBPedia and Freebase. We conducted a serious of experiments that utilise the number of Machine Learning algorithms to report on the favourable implementations/configuration for successful Information Extraction for our targeted domain.

Proceedings ArticleDOI
12 Nov 2015
TL;DR: This work calls statements including advice or requests proposed at previous meetings “task statements” and proposes a method for automatically extracting them using the maximum entropy method, and creates a probabilistic model using this method.
Abstract: We previously developed a discussion mining system that records face-to-face meetings in detail, analyzes their content, and conducts knowledge discovery. Looking back on past discussion content by browsing documents, such as minutes, is an effective means for conducting future activities. In meetings at which some research topics are regularly discussed, such as seminars in laboratories, the presenters are required to discuss future issues by checking urgent matters from the discussion records. We call statements including advice or requests proposed at previous meetings “task statements” and propose a method for automatically extracting them. With this method, based on certain semantic attributes and linguistic characteristics of statements, a probabilistic model is created using the maximum entropy method. A statement is judged whether it is a task statement according to its probability. A seminar-based experiment validated the effectiveness of the proposed extraction method.

Proceedings ArticleDOI
12 Nov 2015
TL;DR: A novel Arabic relation extraction method that leverages linguistic features of the Arabic language in Web data to infer relations between entities, and builds a relation classifier using this data which predicts the relation type of new instances.
Abstract: Relation Extraction is an important preprocessing task for a number of text mining applications, including: Information Retrieval, Question Answering, Ontology building, among others. In this paper, we propose a novel Arabic relation extraction method that leverages linguistic features of the Arabic language in Web data to infer relations between entities. Due to the lack of labeled Arabic corpora, we adopt the idea of distant supervision, where DBpedia, a large database of semantic relations extracted from Wikipedia, is used along with a large unlabeled text corpus to build the training data. We extract the sentences from the unlabeled text corpus, and tag them using the corresponding DBpedia relations. Finally, we build a relation classifier using this data which predicts the relation type of new instances. Our experimental results show that the system reaches 70% for the F-measure in detecting relations.

Proceedings ArticleDOI
12 Nov 2015
TL;DR: This work follows a Divide and Conquer strategy, by defining multiple models (user behavioral patterns), which are exploited to evaluate a new transaction, in order to detect potential attempts of fraud.
Abstract: The exponential and rapid growth of the E-commerce based both on the new opportunities offered by the Internet, and on the spread of the use of debit or credit cards in the online purchases, has strongly increased the number of frauds, causing large economic losses to the involved businesses. The design of effective strategies able to face this problem is however particularly challenging, due to several factors, such as the heterogeneity and the non-stationary distribution of the data stream, as well as the presence of an imbalanced class distribution. To complicate the problem, there is the scarcity of public datasets for confidentiality issues, which does not allow researchers to verify the new strategies in many data contexts. Differently from the canonical state-of-the-art strategies, instead of defining a unique model based on the past transactions of the users, we follow a Divide and Conquer strategy, by defining multiple models (user behavioral patterns), which we exploit to evaluate a new transaction, in order to detect potential attempts of fraud. We can act on some parameters of this process, in order to adapt the models sensitivity to the operating environment. Considering that our models do not need to be trained with both the past legitimate and fraudulent transactions of a user, since they use only the legitimate ones, we can operate in a proactive manner, by detecting fraudulent transactions that have never occurred in the past. Such a way to proceed also overcomes the data imbalance problem that afflicts the machine learning approaches. The evaluation of the proposed approach is performed by comparing it with one of the most performant approaches at the state of the art as Random Forests, using a real-world credit card dataset.

Proceedings ArticleDOI
12 Nov 2015
TL;DR: In this paper, the authors present an approach for evaluation of application ontologies using crowdsourcing, involving application users, which can efficiently help in the improvement of an application ontology all along the ontology lifecycle.
Abstract: This paper presents the basis of our approach for evaluation of application ontologies. Adapting an existing task-based evaluation, this approach explains how crowdsourcing, involving application users, can efficiently help in the improvement of an application ontology all along the ontology lifecycle. A real case experiment on an application ontology designed for the semantic annotation of geobusiness user data illustrates the proposal.

Proceedings ArticleDOI
12 Nov 2015
TL;DR: A generic knowledge-based framework for problem solving in Engineering, in a broad sense, and the KREM (Knowledge, Rules, Experience, Meta-Knowledge) architecture is presented, which improves the efficiency of classic KBSs.
Abstract: This article presents a generic knowledge-based framework for problem solving in Engineering, in a broad sense. After a discussion about the drawbacks of the traditional architecture used for deploying knowledge-based systems (KBS), the KREM (Knowledge, Rules, Experience, Meta-Knowledge) architecture is presented. The novelty of the proposal comes from the inclusion of experience capitalization and of meta-knowledge use into the previously discussed traditional architecture. KREM improves the efficiency of classic KBSs, as it permits to deal with incomplete expert knowledge models, by progressively completing them, learning with experience. Also, the use of meta-knowledge can steer their execution more efficiently. This framework has been successfully used in different projects. Here, the architecture of the KREM model is presented along with some implementation issues and three case studies are discussed.

Proceedings ArticleDOI
12 Nov 2015
TL;DR: It is shown that a system built on top of Solr and Hadoop has the best stability and manageability; while systems based on NoSQL databases present an attractive alternative in terms of performance.
Abstract: The usage of search engines is nowadays extended to do intelligent analytics of petabytes of data. With Lucene being at the heart of the vast majority of information retrieval systems, several attempts are made to bring it to the cloud in order to scale to big data. Efforts include implementing scalable distribution of the search indices over the file system, storing them in NoSQL databases, and porting them to inherently distributed ecosystems, such as Hadoop. We evaluate the existing efforts in terms of distribution, high availability, fault tolerance, manageability, and high performance. We believe that the key to supporting search indexing capabilities for big data can only be achieved through the use of common open-source technology to be deployed on standard cloud platforms such as Amazon EC2, Microsoft Azure, etc. For each approach, we build a benchmarking system by indexing the whole Wikipedia content and submitting hundreds of simultaneous search requests. We measure the performance of both indexing and searching operations. We stimulate node failures and monitor the recoverability of the system. We show that a system built on top of Solr and Hadoop has the best stability and manageability; while systems based on NoSQL databases present an attractive alternative in terms of performance.