scispace - formally typeset
Search or ask a question

Showing papers in "Information Processing and Management in 2015"


Journal ArticleDOI
TL;DR: This work describes a new Twitter entity disambiguation dataset, and conducts an empirical analysis of named entity recognition and disambigsuation, investigating how robust a number of state-of-the-art systems are on such noisy texts, what the main sources of error are, and which problems should be further investigated to improve the state of the art.
Abstract: Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction from tweets is typically performed in a pipeline, comprising consecutive stages of language identification, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). In this work, we describe a new Twitter entity disambiguation dataset, and conduct an empirical analysis of named entity recognition and disambiguation, investigating how robust a number of state-of-the-art systems are on such noisy texts, what the main sources of error are, and which problems should be further investigated to improve the state of the art.

352 citations


Journal ArticleDOI
TL;DR: This work automatically annotates a set of 2012 US presidential election tweets for a number of attributes pertaining to sentiment, emotion, purpose, and style by crowdsourcing, and shows that the tweets convey negative emotions twice as often as positive.
Abstract: We automatically compile a dataset of 2012 US presidential election tweets.We annotate the tweets for sentiment, emotion, style, and purpose.We show that the tweets convey negative emotions twice as often as positive.We describe two automatic systems that predict emotion and purpose in tweets. Social media is playing a growing role in elections world-wide. Thus, automatically analyzing electoral tweets has applications in understanding how public sentiment is shaped, tracking public sentiment and polarization with respect to candidates and issues, understanding the impact of tweets from various entities, etc. Here, for the first time, we automatically annotate a set of 2012 US presidential election tweets for a number of attributes pertaining to sentiment, emotion, purpose, and style by crowdsourcing. Overall, more than 100,000 crowdsourced responses were obtained for 13 questions on emotions, style, and purpose. Additionally, we show through an analysis of these annotations that purpose, even though correlated with emotions, is significantly different. Finally, we describe how we developed automatic classifiers, using features from state-of-the-art sentiment analysis systems, to predict emotion and purpose labels, respectively, in new unseen tweets. These experiments establish baseline results for automatic systems on this new data.

234 citations


Journal ArticleDOI
TL;DR: This paper proposes a semantic approach to QA based on Natural Language Processing techniques, which allow a deep analysis of medical questions and documents and semantic Web technologies at both representation and interrogation levels, and proposes a method for “Answer Search” based on semantic search and query relaxation.
Abstract: The Question Answering (QA) task aims to provide precise and quick answers to user questions from a collection of documents or a database. This kind of IR system is sorely needed with the dramatic growth of digital information. In this paper, we address the problem of QA in the medical domain where several specific conditions are met. We propose a semantic approach to QA based on (i) Natural Language Processing techniques, which allow a deep analysis of medical questions and documents and (ii) semantic Web technologies at both representation and interrogation levels. We present our Semantic Question-Answering System, called MEANS and our proposed method for “Answer Search” based on semantic search and query relaxation. We evaluate the overall system performance on real questions and answers extracted from MEDLINE articles. Our experiments show promising results and suggest that a query-relaxation strategy can further improve the overall performance.

175 citations


Journal ArticleDOI
TL;DR: A novel method is proposed that with respect to its original version is much more conservative at the moment of selecting the negative examples from the unlabeled ones and consistently outperformed the original PU-learning approach in the detection of positive and negative deceptive opinions respectively.
Abstract: Detection of negative deceptive opinion spam.Improved PU-learning approach.Compares the performance of the proposed approach and the original PU-learning method.The role of opinions' polarity in the detection of deception.Reports experimental results on a set of negative deceptive opinions. Nowadays a large number of opinion reviews are posted on the Web. Such reviews are a very important source of information for customers and companies. The former rely more than ever on online reviews to make their purchase decisions, and the latter to respond promptly to their clients' expectations. Unfortunately, due to the business that is behind, there is an increasing number of deceptive opinions, that is, fictitious opinions that have been deliberately written to sound authentic, in order to deceive the consumers promoting a low quality product (positive deceptive opinions) or criticizing a potentially good quality one (negative deceptive opinions). In this paper we focus on the detection of both types of deceptive opinions, positive and negative. Due to the scarcity of examples of deceptive opinions, we propose to approach the problem of the detection of deceptive opinions employing PU-learning. PU-learning is a semi-supervised technique for building a binary classifier on the basis of positive (i.e., deceptive opinions) and unlabeled examples only. Concretely, we propose a novel method that with respect to its original version is much more conservative at the moment of selecting the negative examples (i.e., not deceptive opinions) from the unlabeled ones. The obtained results show that the proposed PU-learning method consistently outperformed the original PU-learning approach. In particular, results show an average improvement of 8.2% and 1.6% over the original approach in the detection of positive and negative deceptive opinions respectively.

148 citations


Journal ArticleDOI
TL;DR: The concept of coupling learning is discussed, focusing on the involvement of coupling relationships in learning systems, with great potential for building a deep understanding of the essence of business problems and handling challenges that have not been addressed well by existing learning theories and tools.
Abstract: Complex applications such as big data analytics involve different forms of coupling relationships that reflect interactions between factors related to technical, business (domain-specific) and environmental (including socio-cultural and economic) aspects. There are diverse forms of couplings embedded in poor-structured and ill-structured data. Such couplings are ubiquitous, implicit and/or explicit, objective and/or subjective, heterogeneous and/or homogeneous, presenting complexities to existing learning systems in statistics, mathematics and computer sciences, such as typical dependency, association and correlation relationships. Modeling and learning such couplings thus is fundamental but challenging. This paper discusses the concept of coupling learning, focusing on the involvement of coupling relationships in learning systems. Coupling learning has great potential for building a deep understanding of the essence of business problems and handling challenges that have not been addressed well by existing learning theories and tools. This argument is verified by several case studies on coupling learning, including handling coupling in recommender systems, incorporating couplings into coupled clustering, coupling document clustering, coupled recommender algorithms and coupled behavior analysis for groups.

136 citations


Journal ArticleDOI
TL;DR: Experimental results on three different cities show that the proposed unsupervised framework for planning personalized sightseeing tours in cities is effective, efficient and outperforms competitive baselines.
Abstract: We propose T rip B uilder , an unsupervised framework for planning personalized sightseeing tours in cities. We collect categorized Points of Interests (PoIs) from Wikipedia and albums of geo-referenced photos from Flickr. By considering the photos as traces revealing the behaviors of tourists during their sightseeing tours, we extract from photo albums spatio-temporal information about the itineraries made by tourists, and we match these itineraries to the Points of Interest (PoIs) of the city. The task of recommending a personalized sightseeing tour is modeled as an instance of the Generalized Maximum Coverage (GMC) problem, where a measure of personal interest for the user given her preferences and visiting time-budget is maximized. The set of actual trajectories resulting from the GMC solution is scheduled on the tourist’s agenda by exploiting a particular instance of the Traveling Salesman Problem (TSP). Experimental results on three different cities show that our approach is effective, efficient and outperforms competitive baselines.

105 citations


Journal ArticleDOI
TL;DR: A systematic overview of how to link and transfer aspect knowledge across corpora written in different languages via the shared space of latent cross-lingual topics is provided, that is, how to effectively employ learned per-topic word distributions and per-document topic distributions of any multilingual probabilistic topic model in various cross-lingsual applications.
Abstract: Probabilistic topic models are unsupervised generative models which model document content as a two-step generation process, that is, documents are observed as mixtures of latent concepts or topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested into transferring the probabilistic topic modeling concept from monolingual to multilingual settings. Novel topic models have been designed to work with parallel and comparable texts. We define multilingual probabilistic topic modeling (MuPTM) and present the first full overview of the current research, methodology, advantages and limitations in MuPTM. As a representative example, we choose a natural extension of the omnipresent LDA model to multilingual settings called bilingual LDA (BiLDA). We provide a thorough overview of this representative multilingual model from its high-level modeling assumptions down to its mathematical foundations. We demonstrate how to use the data representation by means of output sets of (i) per-topic word distributions and (ii) per-document topic distributions coming from a multilingual probabilistic topic model in various real-life cross-lingual tasks involving different languages, without any external language pair dependent translation resource: (1) cross-lingual event-centered news clustering, (2) cross-lingual document classification, (3) cross-lingual semantic similarity, and (4) cross-lingual information retrieval. We also briefly review several other applications present in the relevant literature, and introduce and illustrate two related modeling concepts: topic smoothing and topic pruning. In summary, this article encompasses the current research in multilingual probabilistic topic modeling. By presenting a series of potential applications, we reveal the importance of the language-independent and language pair independent data representations by means of MuPTM. We provide clear directions for future research in the field by providing a systematic overview of how to link and transfer aspect knowledge across corpora written in different languages via the shared space of latent cross-lingual topics, that is, how to effectively employ learned per-topic word distributions and per-document topic distributions of any multilingual probabilistic topic model in various cross-lingual applications.

102 citations


Journal ArticleDOI
TL;DR: The study results show that there are several important themes with a high correlation in Chinese RecSys research, which is considered to be relatively focused, mature, and well-developed overall.
Abstract: We examine RecSys studies in China quantitatively, empirically, and longitudinally.Research on RecSys is relatively mature and well-developed overall.Twelve theme-clusters and six larger branches are identified.Some undeveloped or immature research themes continue to persist.Emerging themes with great potential for development are also identified. This paper examines the research patterns and trends of Recommendation System (RecSys) in China during the period of 2004-2013. Data (keywords in articles) was collected from the China Academic Journal Network Publishing Database (CAJD) and the China Science Periodical Database (CSPD). A co-word analysis was conducted to measure correlation among the extracted keywords. The cluster analysis and social network analysis revealed 12 theme-clusters, network characteristics (centrality and density) of the clusters, the strategic diagram, and the correlation network. The study results show that there are several important themes with a high correlation in Chinese RecSys research, which is considered to be relatively focused, mature, and well-developed overall. Some research themes have developed on a considerable scale, while others remain isolated and undeveloped. This study also identified a few emerging themes with great potential for development. It was also determined that studies overall on the applications of RecSys are increasing.

102 citations


Journal ArticleDOI
TL;DR: This issue, known as overspecialization or serendipity problem, is investigated by proposing a strategy that fosters the suggestion of surprisingly interesting items the user might not have otherwise discovered, by narrowing the accuracy loss.
Abstract: We design a Knowledge Infusion (KI) process for providing systems with background knowledge.We design a KI-based recommendation algorithm for providing serendipitous recommendations.An in vitro evaluation shows the effectiveness of the proposed approach.We collected implicit emotional feedback on serendipitous recommendations.Results show that serendipity is moderately correlated with surprise and happiness. Recommender systems are filters which suggest items or information that might be interesting to users. These systems analyze the past behavior of a user, build her profile that stores information about her interests, and exploit that profile to find potentially interesting items. The main limitation of this approach is that it may provide accurate but likely obvious suggestions, since recommended items are similar to those the user already knows. In this paper we investigate this issue, known as overspecialization or serendipity problem, by proposing a strategy that fosters the suggestion of surprisingly interesting items the user might not have otherwise discovered.The proposed strategy enriches a graph-based recommendation algorithm with background knowledge that allows the system to deeply understand the items it deals with. The hypothesis is that the infused knowledge could help to discover hidden correlations among items that go beyond simple feature similarity and therefore promote non-obvious suggestions. Two evaluations are performed to validate this hypothesis: an in vitro experiment on a subset of the hetrec2011-movielens-2k dataset, and a preliminary user study. Those evaluations show that the proposed strategy actually promotes non-obvious suggestions, by narrowing the accuracy loss.

96 citations


Journal ArticleDOI
TL;DR: It is hypothesized that explicit markers such as hashtags are the digital extralinguistic equivalent of non-verbal expressions that people employ in live interaction when conveying sarcasm.
Abstract: The use of hashtags such as #sarcasm reduces the further use of linguistic markers of sarcasm in tweets.Hashtags such as #sarcasm appear to be the extralinguistic equivalent of non-verbal expressions in live interaction.Sarcastic hashtags are 90% appropriate.Sarcastic tweets without hashtags are hard to distinguish from non-sarcastic hyperbolic tweets.In French tweets, the hashtag #sarcasme conveys a polarity switch less frequently than in Dutch. To avoid a sarcastic message being understood in its unintended literal meaning, in microtexts such as messages on Twitter.com sarcasm is often explicitly marked with a hashtag such as '#sarcasm'. We collected a training corpus of about 406 thousand Dutch tweets with hashtag synonyms denoting sarcasm. Assuming that the human labeling is correct (annotation of a sample indicates that about 90% of these tweets are indeed sarcastic), we train a machine learning classifier on the harvested examples, and apply it to a sample of a day's stream of 2.25 million Dutch tweets. Of the 353 explicitly marked tweets on this day, we detect 309 (87%) with the hashtag removed. We annotate the top of the ranked list of tweets most likely to be sarcastic that do not have the explicit hashtag. 35% of the top-250 ranked tweets are indeed sarcastic. Analysis indicates that the use of hashtags reduces the further use of linguistic markers for signaling sarcasm, such as exclamations and intensifiers. We hypothesize that explicit markers such as hashtags are the digital extralinguistic equivalent of non-verbal expressions that people employ in live interaction when conveying sarcasm. Checking the consistency of our finding in a language from another language family, we observe that in French the hashtag '#sarcasme' has a similar polarity switching function, be it to a lesser extent.

87 citations


Journal ArticleDOI
TL;DR: The research shows that text preprocessing algorithms are mandatory for mining opinions on the Web 2.0 and that part of these algorithms are sensitive to errors and mistakes contained in the user generated content.
Abstract: We carry out an empirical analysis to determine characteristics of social media channels.User generated content is "noisy" and contains mistakes, emoticons, etc.We evaluate text preprocessing algorithms regarding user generated content.Discussion of improvements to opinion mining process. The emerging research area of opinion mining deals with computational methods in order to find, extract and systematically analyze people's opinions, attitudes and emotions towards certain topics. While providing interesting market research information, the user generated content existing on the Web 2.0 presents numerous challenges regarding systematic analysis, the differences and unique characteristics of the various social media channels being one of them. This article reports on the determination of such particularities, and deduces their impact on text preprocessing and opinion mining algorithms. The effectiveness of different algorithms is evaluated in order to determine their applicability to the various social media channels. Our research shows that text preprocessing algorithms are mandatory for mining opinions on the Web 2.0 and that part of these algorithms are sensitive to errors and mistakes contained in the user generated content.

Journal ArticleDOI
TL;DR: ClowdFlows, a cloud-based scientific workflow platform, and its extensions enabling the analysis of data streams and active learning are described, using active learning with a linear Support Vector Machine for learning sentiment classification models to be applied to microblogging data streams.
Abstract: Sentiment analysis from data streams is aimed at detecting authors’ attitude, emotions and opinions from texts in real-time. To reduce the labeling effort needed in the data collection phase, active learning is often applied in streaming scenarios, where a learning algorithm is allowed to select new examples to be manually labeled in order to improve the learner’s performance. Even though there are many on-line platforms which perform sentiment analysis, there is no publicly available interactive on-line platform for dynamic adaptive sentiment analysis, which would be able to handle changes in data streams and adapt its behavior over time. This paper describes ClowdFlows, a cloud-based scientific workflow platform, and its extensions enabling the analysis of data streams and active learning. Moreover, by utilizing the data and workflow sharing in ClowdFlows, the labeling of examples can be distributed through crowdsourcing. The advanced features of ClowdFlows are demonstrated on a sentiment analysis use case, using active learning with a linear Support Vector Machine for learning sentiment classification models to be applied to microblogging data streams.

Journal ArticleDOI
TL;DR: A way to compare learning to rank methods based on a sparse set of evaluation results on a set of benchmark datasets is proposed, which consists of Normalized Winning Number and Ideal Winning Number.
Abstract: Learning to rank is an increasingly important scientific field that comprises the use of machine learning for the ranking task. New learning to rank methods are generally evaluated on benchmark test collections. However, comparison of learning to rank methods based on evaluation results is hindered by the absence of a standard set of evaluation benchmark collections. In this paper we propose a way to compare learning to rank methods based on a sparse set of evaluation results on a set of benchmark datasets. Our comparison methodology consists of two components: (1) Normalized Winning Number, which gives insight in the ranking accuracy of the learning to rank method, and (2) Ideal Winning Number, which gives insight in the degree of certainty concerning its ranking accuracy. Evaluation results of 87 learning to rank methods on 20 well-known benchmark datasets are collected through a structured literature search. ListNet, SmoothRank, FenchelRank, FSMRank, LRUF and LARF are Pareto optimal learning to rank methods in the Normalized Winning Number and Ideal Winning Number dimensions, listed in increasing order of Normalized Winning Number and decreasing order of Ideal Winning Number.

Journal ArticleDOI
TL;DR: The results showed that participants with prior knowledge in the domain (experts in psychology) performed better (i.e. reached more correct answers after shorter search times) than non-experts, and this difference was stronger as the complexity of the problems increased.
Abstract: Three complexity levels of information problems are defined related to psychology domain.40 students in psychology and in other domains performed information problems with an online encyclopedia.Students in psychology performed better than the others, especially for complex problems.Students in psychologies used more relevant strategies than the others.These expertise-related differences are stronger for the complex problems. This study addresses the impact of domain expertise (i.e. of prior knowledge of the domain) on the performance and query strategies used by users while searching for information. Twenty-four experts (psychology students) and 24 non-experts (students from other disciplines) had to search for psychology information from the Universalis website in order to perform six information problems of varying complexity: two simple problems (the keywords required to complete the task were provided in the problem statement), two more difficult problems (the keywords required had to be inferred) and two impossible problems (no answer was provided by the website). The results showed that participants with prior knowledge in the domain (experts in psychology) performed better (i.e. reached more correct answers after shorter search times) than non-experts. This difference was stronger as the complexity of the problems increased. This study also showed that experts and non-experts displayed different query strategies. Experts reformulated the impossible problems more often than non-experts, because they produced new queries with psychology-related keywords. The participants rarely used thematic category tool and when they did so this did not enhance their performance.

Journal ArticleDOI
TL;DR: This paper proposes some novel feature based similarity assessment methods that are fully dependent on Wikipedia and can avoid most of the limitations and drawbacks introduced above.
Abstract: Semantic similarity assessment between concepts is an important task in many language related applications. In the past, several approaches to assess similarity by evaluating the knowledge modeled in an (or multiple) ontology (or ontologies) have been proposed. However, there are some limitations such as the facts of relying on predefined ontologies and fitting non-dynamic domains in the existing measures. Wikipedia provides a very large domain-independent encyclopedic repository and semantic network for computing semantic similarity of concepts with more coverage than usual ontologies. In this paper, we propose some novel feature based similarity assessment methods that are fully dependent on Wikipedia and can avoid most of the limitations and drawbacks introduced above. To implement similarity assessment based on feature by making use of Wikipedia, firstly a formal representation of Wikipedia concepts is presented. We then give a framework for feature based similarity based on the formal representation of Wikipedia concepts. Lastly, we investigate several feature based approaches to semantic similarity measures resulting from instantiations of the framework. The evaluation, based on several widely used benchmarks and a benchmark developed in ourselves, sustains the intuitions with respect to human judgements. Overall, several methods proposed in this paper have good human correlation and constitute some effective ways of determining similarity between Wikipedia concepts.

Journal ArticleDOI
TL;DR: A novel sentiment-aware social media recommendation framework, referred to as SA_OCCF, is developed and an ensemble learning-based method is proposed to classify sentiments from affective texts to improve recommendation performance.
Abstract: We propose a sentiment-aware social media recommendation framework.An ensemble learning-based method is proposed to classify sentiments from affective texts.We conduct comprehensive experiments to verify the effectiveness of the proposed methods. Social media websites, such as YouTube and Flicker, are currently gaining in popularity. A large volume of information is generated by online users and how to appropriately provide personalized content is becoming more challenging. Traditional recommendation models are overly dependent on preference ratings and often suffer from the problem of "data sparsity". Recent research has attempted to integrate sentiment analysis results of online affective texts into recommendation models; however, these studies are still limited. The one class collaborative filtering (OCCF) method is more applicable in the social media scenario yet it is insufficient for item recommendation. In this study, we develop a novel sentiment-aware social media recommendation framework, referred to as SA_OCCF, in order to tackle the above challenges. We leverage inferred sentiment feedback information and OCCF models to improve recommendation performance. We conduct comprehensive experiments on a real social media web site to verify the effectiveness of the proposed framework and methods. The results show that the proposed methods are effective in improving the performance of the baseline OCCF methods.

Journal ArticleDOI
TL;DR: The main contribution of this work is the design of a novel gesture recognition system based solely on data from a single 3-dimensional accelerometer.
Abstract: Background : Our methodology describes a human activity recognition framework based on feature extraction and feature selection techniques where a set of time, statistical and frequency domain features taken from 3-dimensional accelerometer sensors are extracted. This framework specifically focuses on activity recognition using on-body accelerometer sensors. We present a novel interactive knowledge discovery tool for accelerometry in human activity recognition and study the sensitivity to the feature extraction parametrization. Results : The implemented framework achieved encouraging results in human activity recognition. We have implemented a new set of features extracted from wearable sensors that are ambitious from a computational point of view and able to ensure high classification results comparable with the state of the art wearable systems (Mannini et al. 2013). A feature selection framework is developed in order to improve the clustering accuracy and reduce computational complexity. 1 Several clustering methods such as K-Means, Affinity Propagation, Mean Shift and Spectral Clustering were applied. The K-means methodology presented promising accuracy results for person-dependent and independent cases, with 99.29% and 88.57%, respectively. Conclusions : The presented study performs two different tests in intra and inter subject context and a set of 180 features is implemented which are easily selected to classify different activities. The implemented algorithm does not stipulate, a priori , any value for time window or its overlap percentage of the signal but performs a search to find the best parameters that define the specific data. A clustering metric based on the construction of the data confusion matrix is also proposed. The main contribution of this work is the design of a novel gesture recognition system based solely on data from a single 3-dimensional accelerometer.

Journal ArticleDOI

[...]

TL;DR: Experimental results illustrate that POS-RS can be used as a viable method for sentiment classification and has the potential of being successfully applied to other text classification problems.
Abstract: The rise of social media has fueled interest in sentiment classification.POS-RS is proposed for sentiment analysis based on part-of-speech analysis.Ten public datasets were investigated to verify the effectiveness of POS-RS.Experimental results reveal POS-RS can be used as a viable method. With the rise of Web 2.0 platforms, personal opinions, such as reviews, ratings, recommendations, and other forms of user-generated content, have fueled interest in sentiment classification in both academia and industry. In order to enhance the performance of sentiment classification, ensemble methods have been investigated by previous research and proven to be effective theoretically and empirically. We advance this line of research by proposing an enhanced Random Subspace method, POS-RS, for sentiment classification based on part-of-speech analysis. Unlike existing Random Subspace methods using a single subspace rate to control the diversity of base learners, POS-RS employs two important parameters, i.e. content lexicon subspace rate and function lexicon subspace rate, to control the balance between the accuracy and diversity of base learners. Ten publicly available sentiment datasets were investigated to verify the effectiveness of proposed method. Empirical results reveal that POS-RS achieves the best performance through reducing bias and variance simultaneously compared to the base learner, i.e., Support Vector Machine. These results illustrate that POS-RS can be used as a viable method for sentiment classification and has the potential of being successfully applied to other text classification problems.

Journal ArticleDOI
TL;DR: A failure in identifying individual differences that may influence a person's likelihood to experience serendipity is found, in contrast with success in identifying how the environment in which the user is immersed may create a fertile environment for serendipsity to occur.
Abstract: We developed a three-factor scale to measure a serendipitous digital environment.Type of digital environment (e.g. social media, database) may influence serendipity.Digital environment characteristics such as trigger-rich may influence serendipity.Individual differences such as openness do not significantly influence serendipity. Under what conditions is serendipity most likely to occur? How much is serendipity influenced by what a person brings to the process, and how much by the environment in which the person is immersed? This study assessed (a) selected human characteristics that may influence the ability to experience serendipity (openness to experience, extraversion, and locus of control) and (b) selected perceptions of the environment in which people are immersed, including the creative environment, and selected characteristics (trigger rich, highlights triggers, enables connections, and leads to the unexpected). Finally, the study examined the relationships among these internal people-based and external, environmental, variables. Professionals, academics, and students engaged in thesis work (N=289) responded to a web-based questionnaire that integrated six scales to measure these variables. Results were analysed using principal components analysis, multivariate analysis of variance, and multiple regression. We found some types of digital environments, (e.g., websites, databases, search engines, intranets, social media sites) may be more conducive to serendipity than others, while environments that manifest selected characteristics (trigger-rich, enable connections, and lead to the unexpected) are perceived more likely to foster serendipity than others. However, the perceived level of creativity expected in work environments was not associated with serendipity. In addition, while extraverted people may be more likely to experience serendipity in general, those who are open to experience or have an external locus of control are no more likely to experience serendipity than their counterparts. Notable from our findings was a failure in identifying individual differences that may influence a person's likelihood to experience serendipity, in contrast with our success in identifying how the environment in which the user is immersed may create a fertile environment for serendipity to occur.

Journal ArticleDOI
TL;DR: A text mining based method, the three-phase prediction (TPP) algorithm, which allows the general public to use everyday vocabulary to describe their problems and find pertinent statutes for their cases.
Abstract: Applying text mining techniques to legal issues has been an emerging research topic in recent years. Although a few previous studies focused on assisting professionals in the retrieval of related legal documents, to our knowledge, no previous studies could provide relevant statutes to the general public using problem statements. In this work, we design a text mining based method, the three-phase prediction (TPP) algorithm, which allows the general public to use everyday vocabulary to describe their problems and find pertinent statutes for their cases. The experimental results indicate that our approach can help the general public, who are not familiar with professional legal terms, to acquire relevant statutes more accurately and effectively.

Journal ArticleDOI
TL;DR: Evidence is found to suggest that the User Engagement Scale can differentiate between systems and experimental conditions, and that a four-factor structure may be more appropriate than the original six-Factor structure proposed in earlier work.
Abstract: We examined the robustness of the User Engagement Scale (UES).Three studies were conducted in Canada and the United Kingdom.The UES sub-scales were reliable across three samples of online news browser.A four-factor structure, rather than six, may be more appropriate.The UES differentiated between online news sources and experimental conditions. Questionnaires are commonly used to measure attitudes toward systems and perceptions of search experiences. Whilst the face validity of such measures has been established through repeated use in information retrieval research, their reliability and wider validity are not typically examined; this threatens internal validity. The evaluation of self-report questionnaires is important not only for the internal validity of studies and, by extension, increased confidence in the results, but also for examining constructs of interest over time and across different domains and systems.In this paper, we look at a specific questionnaire, the User Engagement Scale (UES), for its robustness as a measure. We describe three empirical studies conducted in the online news domain and investigate the reliability and validity of the UES. Our results demonstrate good reliability of the UES sub-scales; however, we argue that a four-factor structure may be more appropriate than the original six-factor structure proposed in earlier work. In addition, we found evidence to suggest that the UES can differentiate between systems (in this case, online news sources) and experimental conditions (i.e., the type of media used to present online content).

Journal ArticleDOI
TL;DR: This research augments a frequency-based extraction method with PMI-IR, which utilizes web search in measuring the semantic similarity between aspect candidates and target entities, and extends RCut, an algorithm originally developed for text classification, to learn the threshold for selecting candidate aspects.
Abstract: Online review mining has been used to help manufacturers and service providers improve their products and services, and to provide valuable support for consumer decision making. Product aspect extraction is fundamental to online review mining. This research is aimed to improve the performance of aspect extraction from online consumer reviews. To this end, we augment a frequency-based extraction method with PMI-IR, which utilizes web search in measuring the semantic similarity between aspect candidates and target entities. In addition, we extend RCut, an algorithm originally developed for text classification, to learn the threshold for selecting candidate aspects. Experiment results with Chinese online reviews show that our proposed method not only outperforms the state of the art frequency-based method for aspect extraction but also generalizes across different product domains and various data sizes.

Journal ArticleDOI
TL;DR: It is shown that the use of hybrid features and multilingual, machine-translated data (even from other languages) can help to better distinguish relevant features for sentiment classification and thus increase the precision of sentiment analysis systems.
Abstract: We study different strategies to classify sentiment from tweets, using supervised learning with hybrid features.We experiment with English and Spanish data and compare against benchmark competitions.We employ machine-translated data from other languages for training.We show that the use of multilingual data improves the sentiment classification accuracy. Nowadays opinion mining systems play a strategic role in different areas such as Marketing, Decision Support Systems or Policy Support. Since the arrival of the Web 2.0, more and more textual documents containing information that express opinions or comments in different languages are available. Given the proven importance of such documents, the use of effective multilingual opinion mining systems has become of high importance to different fields. This paper presents the experiments carried out with the objective to develop a multilingual sentiment analysis system. We present initial evaluations of methods and resources performed in two international evaluation campaigns for English and for Spanish. After our participation in both competitions, additional experiments were carried out with the aim of improving the performance of both Spanish and English systems by using multilingual machine-translated data. Based on our evaluations, we show that the use of hybrid features and multilingual, machine-translated data (even from other languages) can help to better distinguish relevant features for sentiment classification and thus increase the precision of sentiment analysis systems.

Journal ArticleDOI
TL;DR: A novel query expansion method that makes use of a minimal relevance feedback to expand the initial query with a structured representation composed of weighted pairs of words based on the Probabilistic Topic Model is proposed.
Abstract: This paper proposes a novel query expansion method to improve accuracy of text retrieval systems. Our method makes use of a minimal relevance feedback to expand the initial query with a structured representation composed of weighted pairs of words. Such a structure is obtained from the relevance feedback through a method for pairs of words selection based on the Probabilistic Topic Model. We compared our method with other baseline query expansion schemes and methods. Evaluations performed on TREC-8 demonstrated the effectiveness of the proposed method with respect to the baseline.

Journal ArticleDOI
TL;DR: The experimental outcomes show the effectiveness of the model-based approach for data quality as they provide a fine-grained analysis of both the source dataset and the cleansing procedures, enabling domain experts to identify the most relevant quality issues as well as the action points for improving the cleansing activities.
Abstract: We live in the Information Age, where most of the personal, business, and administrative data are collected and managed electronically. However, poor data quality may affect the effectiveness of knowledge discovery processes, thus making the development of the data improvement steps a significant concern. In this paper we propose the Multidimensional Robust Data Quality Analysis, a domain-independent technique aimed to improve data quality by evaluating the effectiveness of a black-box cleansing function. Here, the proposed approach has been realized through model checking techniques and then applied on a weakly structured dataset describing the working careers of millions of people. Our experimental outcomes show the effectiveness of our model-based approach for data quality as they provide a fine-grained analysis of both the source dataset and the cleansing procedures, enabling domain experts to identify the most relevant quality issues as well as the action points for improving the cleansing activities. Finally, an anonymized version of the dataset and the analysis results have been made publicly available to the community.

Journal ArticleDOI
TL;DR: The generation of several resources for domain adaptation to polarity detection and the results show the validity of the new sentiment lexicons, which can be used as part of a polarity classifier.
Abstract: A lexicon-based domain adaptation method is proposed.Several domain polar lexicons were compiled following a corpus-based approach.The new resources are assessed over a Spanish corpus.The promising results encourage us to follow improving this domain adaptation method. One of the problems of opinion mining is the domain adaptation of the sentiment classifiers. There are several approaches to tackling this problem. One of these is the integration of a list of opinion bearing words for the specific domain. This paper presents the generation of several resources for domain adaptation to polarity detection. On the other hand, the lack of resources in languages different from English has orientated our work towards developing sentiment lexicons for polarity classifiers in Spanish. The results show the validity of the new sentiment lexicons, which can be used as part of a polarity classifier.

Journal ArticleDOI
TL;DR: The goal of creating a multi-purpose stemming tool that opens up possibilities of solving non-traditional tasks such as approximating lemmas or improving language modeling is pursued and it was discovered that the approach demands very little text data for training when compared with competing unsupervised algorithms.
Abstract: Research into unsupervised ways of stemming has resulted, in the past few years, in the development of methods that are reliable and perform well. Our approach further shifts the boundaries of the state of the art by providing more accurate stemming results. The idea of the approach consists in building a stemmer in two stages. In the first stage, a stemming algorithm based upon clustering, which exploits the lexical and semantic information of words, is used to prepare large-scale training data for the second-stage algorithm. The second-stage algorithm uses a maximum entropy classifier. The stemming-specific features help the classifier decide when and how to stem a particular word. In our research, we have pursued the goal of creating a multi-purpose stemming tool. Its design opens up possibilities of solving non-traditional tasks such as approximating lemmas or improving language modeling. However, we still aim at very good results in the traditional task of information retrieval. The conducted tests reveal exceptional performance in all the above mentioned tasks. Our stemming method is compared with three state-of-the-art statistical algorithms and one rule-based algorithm. We used corpora in the Czech, Slovak, Polish, Hungarian, Spanish and English languages. In the tests, our algorithm excels in stemming previously unseen words (the words that are not present in the training set). Moreover, it was discovered that our approach demands very little text data for training when compared with competing unsupervised algorithms.

Journal ArticleDOI
TL;DR: A novel multi-document summarization system called FoDoSu, or Folksonomy-based Multi-Document Summarization, that employs the tag clusters used by Flickr, a Folksonomic system, for detecting key sentences from multiple documents is proposed.
Abstract: Multi-document summarization techniques aim to reduce documents into a small set of words or paragraphs that convey the main meaning of the original document. Many approaches to multi-document summarization have used probability-based methods and machine learning techniques to simultaneously summarize multiple documents sharing a common topic. However, these techniques fail to semantically analyze proper nouns and newly-coined words because most depend on an out-of-date dictionary or thesaurus. To overcome these drawbacks, we propose a novel multi-document summarization system called FoDoSu, or Folksonomy-based Multi-Document Summarization, that employs the tag clusters used by Flickr, a Folksonomy system, for detecting key sentences from multiple documents. We first create a word frequency table for analyzing the semantics and contributions of words using the HITS algorithm. Then, by exploiting tag clusters, we analyze the semantic relationships between words in the word frequency table. Finally, we create a summary of multiple documents by analyzing the importance of each word and its semantic relatedness to others. Experimental results from the TAC 2008 and 2009 data sets demonstrate the improvement of our proposed framework over existing summarization systems.

Journal ArticleDOI
TL;DR: This paper utilizes tensor model and topic model simultaneously to extract latent semantic relations among asker, question and answerer and proposes a learning procedure to get optimal ranking of answerers for new questions by optimizing the multi-class AUC (Area Under the ROC Curve).
Abstract: Community question answering (CQA) services that enable users to ask and answer questions have become popular on the internet. However, lots of new questions usually cannot be resolved by appropriate answerers effectively. To address this question routing task, in this paper, we treat it as a ranking problem and rank the potential answerers by the probability that they are able to solve the given new question. We utilize tensor model and topic model simultaneously to extract latent semantic relations among asker, question and answerer. Then, we propose a learning procedure based on the above models to get optimal ranking of answerers for new questions by optimizing the multi-class AUC (Area Under the ROC Curve). Experimental results on two real-world CQA datasets show that the proposed method is able to predict appropriate answerers for new questions and outperforms other state-of-the-art approaches.

Journal ArticleDOI
TL;DR: In experiments with three benchmark shape, face and handwritten digit image data sets, the proposed method outperforms competitive spectral clustering methods that either follow semi-supervised or scalable strategies.
Abstract: We face the real-world problem of having a limited set of pairwise constraints.Using pairwise constraints connected components (CC) are generated.The points' local neighborhoods of the same CC are dynamically adapted.Constraints propagation to CC neighborhoods to increase the clustering accuracy.Scalability is ensured by following a landmark strategy. In this paper, we present an efficient spectral clustering method for large-scale data sets, given a set of pairwise constraints. Our contribution is threefold: (a) clustering accuracy is increased by injecting prior knowledge of the data points' constraints to a small affinity submatrix; (b) connected components are identified automatically based on the data points' pairwise constraints, generating thus isolated "islands" of points; furthermore, local neighborhoods of points of the same connected component are adapted dynamically, and constraints propagation is performed so as to further increase the clustering accuracy; finally (c) the complexity is preserved low, by following a sparse coding strategy of a landmark spectral clustering. In our experiments with three benchmark shape, face and handwritten digit image data sets, we show that the proposed method outperforms competitive spectral clustering methods that either follow semi-supervised or scalable strategies.