scispace - formally typeset
Search or ask a question

Showing papers in "Information Processing and Management in 2017"


Journal ArticleDOI
TL;DR: A novel metaheuristic method (CSK) which is based on K-means and cuckoo search which is used to find the optimum cluster-heads from the sentimental contents of Twitter dataset is proposed.
Abstract: A hybrid cuckoo search method (CSK) has been presented for Twitter sentiment analysis.CSK modifies the random initialization of population in cuckoo search (CS) by K-means to resolve the problem of random initialization.The proposed algorithm has outperformed five popular algorithms.The statistical analysis has been done to validate the performance of the proposed algorithm. Sentiment analysis is one of the prominent fields of data mining that deals with the identification and analysis of sentimental contents generally available at social media. Twitter is one of such social medias used by many users about some topics in the form of tweets. These tweets can be analyzed to find the viewpoints and sentiments of the users by using clustering-based methods. However, due to the subjective nature of the Twitter datasets, metaheuristic-based clustering methods outperforms the traditional methods for sentiment analysis. Therefore, this paper proposes a novel metaheuristic method (CSK) which is based on K-means and cuckoo search. The proposed method has been used to find the optimum cluster-heads from the sentimental contents of Twitter dataset. The efficacy of proposed method has been tested on different Twitter datasets and compared with particle swarm optimization, differential evolution, cuckoo search, improved cuckoo search, gauss-based cuckoo search, and two n-grams methods. Experimental results and statistical analysis validate that the proposed method outperforms the existing methods. The proposed method has theoretical implications for the future research to analyze the data generated through social networks/medias. This method has also very generalized practical implications for designing a system that can provide conclusive reviews on any social issues.

253 citations


Journal ArticleDOI
TL;DR: This study proposes a novel multi-text summarization technique for identifying the top-k most informative sentences of hotel reviews, and developed a new sentence importance metric.
Abstract: Text summarization technique can extract essential information from online reviews.Our method can identify top-k most informative sentences from online hotel reviews.We jointly considered author, review time, usefulness, and opinion factors.Online hotel reviews were collected from TripAdvisor in experimental evaluation.The results show that our approach provides more comprehensive hotel information. Online travel forums and social networks have become the most popular platform for sharing travel information, with enormous numbers of reviews posted daily. Automatically generated hotel summaries could aid travelers in selecting hotels. This study proposes a novel multi-text summarization technique for identifying the top-k most informative sentences of hotel reviews. Previous studies on review summarization have primarily examined content analysis, which disregards critical factors like author credibility and conflicting opinions. We considered such factors and developed a new sentence importance metric. Both the content and sentiment similarities were used to determine the similarity of two sentences. To identify the top-k sentences, the k-medoids clustering algorithm was used to partition sentences into k groups. The medoids from these groups were then selected as the final summarization results. To evaluate the performance of the proposed method, we collected two sets of reviews for the two hotels posted on TripAdvisor.com. A total of 20 subjects were invited to review the text summarization results from the proposed approach and two conventional approaches for the two hotels. The results indicate that the proposed approach outperforms the other two, and most of the subjects believed that the proposed approach can provide more comprehensive hotel information.

243 citations


Journal ArticleDOI
TL;DR: A set of regional tourist experiences related to a Southern European region and destination is explored, to derive patterns and opportunities of value creation generated by Big Data in tourism.
Abstract: This paper aims to demonstrate how the huge amount of Social Big Data available from tourists can nurture the value creation process for a Smart Tourism Destination. Applying a multiple-case study analysis, the paper explores a set of regional tourist experiences related to a Southern European region and destination, to derive patterns and opportunities of value creation generated by Big Data in tourism. Findings present and discuss evidence in terms of improving decision-making, creating marketing strategies with more personalized offerings, transparency and trust in dialogue with customers and stakeholders, and emergence of new business models. Finally, implications are presented for researchers and practitioners interested in the managerial exploitation of Big Data in the context of information-intensive industries and mainly in Tourism.

236 citations


Journal ArticleDOI
TL;DR: A novel, semi-automated, fully replicable, analytical methodology based on a combination of machine learning algorithms and expert judgement is proposed to drive clarity across the heterogeneous nature of skills required in Big Data professions.
Abstract: The rapid expansion of Big Data Analytics is forcing companies to rethink their Human Resource (HR) needs. However, at the same time, it is unclear which types of job roles and skills constitute this area. To this end, this study pursues to drive clarity across the heterogeneous nature of skills required in Big Data professions, by analyzing a large amount of real-world job posts published online. More precisely we: 1) identify four Big Data ‘job families’; 2) recognize nine homogeneous groups of Big Data skills (skill sets) that are being demanded by companies; 3) characterize each job family with the appropriate level of competence required within each Big Data skill set. We propose a novel, semi-automated, fully replicable, analytical methodology based on a combination of machine learning algorithms and expert judgement. Our analysis leverages a significant amount of online job posts, obtained through web scraping, to generate an intelligible classification of job roles and skill sets. The results can support business leaders and HR managers in establishing clear strategies for the acquisition and the development of the right skills needed to leverage Big Data at best. Moreover, the structured classification of job families and skill sets will help establish a common dictionary to be used by HR recruiters and education providers, so that supply and demand can more effectively meet in the job marketplace.

193 citations


Journal ArticleDOI
TL;DR: A hybrid ensemble pruning scheme based on clustering and randomized search for text sentiment classification and a consensus clustering scheme is presented to deal with the instability of clustering results.
Abstract: Sentiment analysis is a critical task of extracting subjective information from online text documents. Ensemble learning can be employed to obtain more robust classification schemes. However, most approaches in the field incorporated feature engineering to build efficient sentiment classifiers.The purpose of our research is to establish an effective sentiment classification scheme by pursuing the paradigm of ensemble pruning. Ensemble pruning is a crucial method to build classifier ensembles with high predictive accuracy and efficiency. Previous studies employed exponential search, randomized search, sequential search, ranking based pruning and clustering based pruning. However, there are tradeoffs in selecting the ensemble pruning methods. In this regard, hybrid ensemble pruning schemes can be more promising.In this study, we propose a hybrid ensemble pruning scheme based on clustering and randomized search for text sentiment classification. Furthermore, a consensus clustering scheme is presented to deal with the instability of clustering results. The classifiers of the ensemble are initially clustered into groups according to their predictive characteristics. Then, two classifiers from each cluster are selected as candidate classifiers based on their pairwise diversity. The search space of candidate classifiers is explored by the elitist Pareto-based multi-objective evolutionary algorithm.For the evaluation task, the proposed scheme is tested on twelve balanced and unbalanced benchmark text classification tasks. In addition, the proposed approach is experimentally compared with three ensemble methods (AdaBoost, Bagging and Random Subspace) and three ensemble pruning algorithms (ensemble selection from libraries of models, Bagging ensemble selection and LibD3C algorithm). Results demonstrate that the consensus clustering and the elitist pareto-based multi-objective evolutionary algorithm can be effectively used in ensemble pruning. The experimental analysis with conventional ensemble methods and pruning algorithms indicates the validity and effectiveness of the proposed scheme.

181 citations


Journal ArticleDOI
TL;DR: A detailed analytical mapping of OMSA research work is presented and the progress of discipline on various useful parameters are charted.
Abstract: The new transformed read-write Web has resulted in a rapid growth of user generated content on the Web resulting into a huge volume of unstructured data. A substantial part of this data is unstructured text such as reviews and blogs. Opinion mining and sentiment analysis (OMSA) as a research discipline has emerged during last 15 years and provides a methodology to computationally process the unstructured data mainly to extract opinions and identify their sentiments. The relatively new but fast growing research discipline has changed a lot during these years. This paper presents a scientometric analysis of research work done on OMSA during 2000–2016. For the scientometric mapping, research publications indexed in Web of Science (WoS) database are used as input data. The publication data is analyzed computationally to identify year-wise publication pattern, rate of growth of publications, types of authorship of papers on OMSA, collaboration patterns in publications on OMSA, most productive countries, institutions, journals and authors, citation patterns and an year-wise citation reference network, and theme density plots and keyword bursts in OMSA publications during the period. A somewhat detailed manual analysis of the data is also performed to identify popular approaches (machine learning and lexicon-based) used in these publications, levels (document, sentence or aspect-level) of sentiment analysis work done and major application areas of OMSA. The paper presents a detailed analytical mapping of OMSA research work and charts the progress of discipline on various useful parameters.

157 citations


Journal ArticleDOI
TL;DR: In order to build a complete structure of subjects areas in iMetrics, both types of keywords are included in this study and a two-dimensional map and a strategic diagram are drawn to clarify the structure, maturity, and cohesion of clusters.
Abstract: In order to build a complete structure of subjects areas in iMetrics, both types of keywords are included in this study. Application of hierarchical clustering analysis led to the formation of 11 subject clusters in iMetrics. Analysis of the strategic diagram showed two most comprehensive themes of iMetrics. In spite of their similarity with those of Courtial (1994), clusters in this study were different from several aspects. As an iMetrics technique, co-word analysis is used to describe the status of various subject areas, however, iMetrics itself is not examined by a co-word analysis. For the purpose of using co-word analysis, this study tries to investigate the intellectual structure of iMetrics during the period of 1978 to 2014. The research data are retrieved from two core journals on iMetrics research (Scientometrics, and Journal of Informetrics) and relevant articles in six journals publishing iMetrics studies. Application of hierarchical clustering led to the formation of 11 clusters representing the intellectual structure of iMetrics, including Scientometric Databases and Indicators, Citation Analysis, Sociology of Science, Issues Related to Rankings of Universities, Journals, etc., Information Visualization and Retrieval, Mapping Intellectual Structure of Science, Webometrics, IndustryUniversityGovernment Relations, Technometrics (Innovation and Patents), Scientific Collaboration in Universities, and Basics of Network Analysis. Furthermore, a two-dimensional map and a strategic diagram are drawn to clarify the structure, maturity, and cohesion of clusters.

111 citations


Journal ArticleDOI
TL;DR: A new feature ranking (FR) metric, called normalized difference measure (NDM), which takes into account the relative document frequencies is proposed, which outperforms the seven metrics in 66% cases in terms of macro-F1 measure and in 51% casesIn terms of micro F1 measure in the authors' experimental trials.
Abstract: We analyzed Balanced Accuracy (ACC2)feature ranking metrics and identified its draw backs.We proposed to normalize Balanced Accuracy by minimum of tpr and fpr values.We compared results of proposed feature ranking metric with seven well known feature ranking metrics on seven datasets.Newly proposed metric outperforms in more than 60% cases of our experimental trials. The goal of feature selection in text classification is to choose highly distinguishing features for improving the performance of a classifier. The well-known text classification feature selection metric named balanced accuracy measure (ACC2) (Forman, 2003) evaluates a term by taking the difference of its document frequency in the positive class (also known as true positives) and its document frequency in the negative class (also known as false positives). This however results in assigning equal ranks to terms having equal difference, ignoring their relative document frequencies in the classes. In this paper we propose a new feature ranking (FR) metric, called normalized difference measure (NDM), which takes into account the relative document frequencies. The performance of NDM is investigated against seven well known feature ranking metrics including odds ratio (OR), chi squared (CHI), information gain (IG), distinguishing feature selector (DFS), gini index (GINI) ,balanced accuracy measure (ACC2) and Poisson ratio (POIS) on seven datasets namely WebACE(WAP,K1a,K1b), Reuters (RE0, RE1),spam email dataset and 20 newsgroups using the multinomial naive Bayes (MNB) and supports vector machines (SVM) classifiers. Our results show that the NDM metric outperforms the seven metrics in 66% cases in terms of macro-F1 measure and in 51% cases in terms of micro F1 measure in our experimental trials on these datasets.

107 citations


Journal ArticleDOI
TL;DR: This research paper investigates the current trends and identifies the existing challenges in development of a big scholarly data platform, with specific focus on directions for future research and maps them to the different phases of the big data lifecycle.
Abstract: Survey of big scholarly data with respect to the different phases of the big data lifecycle.Identifies the different big data tools and technologies that can be used for development of scholarly applications.Investigates research challenges and limitations specific to big scholarly data and its applications.Provides research directions and paves way towards the development of a generic and comprehensive big scholarly data platform. Recently, there has been a shifting focus of organizations and governments towards digitization of academic and technical documents, adding a new facet to the concept of digital libraries. The volume, variety and velocity of this generated data, satisfies the big data definition, as a result of which, this scholarly reserve is popularly referred to as big scholarly data. In order to facilitate data analytics for big scholarly data, architectures and services for the same need to be developed. The evolving nature of research problems has made them essentially interdisciplinary. As a result, there is a growing demand for scholarly applications like collaborator discovery, expert finding and research recommendation systems, in addition to several others. This research paper investigates the current trends and identifies the existing challenges in development of a big scholarly data platform, with specific focus on directions for future research and maps them to the different phases of the big data lifecycle.

104 citations


Journal ArticleDOI
TL;DR: The impact of the Big Five personality traits on human online information seeking is explored and individuals high in conscientiousness performed fastest in most information-seeking tasks, followed by those high in agreeableness and extraversion.
Abstract: We studied eye-movement behavior in different information-seeking tasks.We found three patterns of information seeking based on users personality facets.Personality traits drive information seeking differently depending on the task.Eye-movement parameters can predict these patterns in different information-seeking behaviors. Although personality traits may influence information-seeking behavior, little is known about this topic. This study explored the impact of the Big Five personality traits on human online information seeking. For this purpose, it examined changes in eye-movement behavior in a sample of 75 participants (36 male and 39 female; age: 2239 years; experience conducting online searches: 512 years) across three types of information-seeking tasks factual, exploratory, and interpretive. The International Personality Item Pool Representation of the NEO PI-R (IPIP-NEO) was used to assess the participants personality profile. Hierarchical cluster analysis was used to categorize participants based on their personality traits. A three cluster solution was found (cluster one consists of participants who scored high in conscientiousness; cluster two consists of participants who scored high in agreeableness; and cluster three consists of participants who scored high in extraversion). Results revealed that individuals high in conscientiousness performed fastest in most information-seeking tasks, followed by those high in agreeableness and extraversion. This study has important practical implications for intelligent human computer interfaces, personalization, and related applications.

94 citations


Journal ArticleDOI
TL;DR: A learning framework to predict the best ranking of experts in future on StackOverflow which is currently one of the most successful CQAs is proposed and it is found that among all of these feature groups, user behaviors have the most influence in the estimation of future expertise probability.
Abstract: Community Question Answering is one of the valuable information resources which provide users with a platform to share their knowledge. Finding potential experts in CQA is beneficial to several problems like low participation rate of the users, long waiting time to receive answers and to the low quality of answers. Many research papers focused on retrieving the expert users of CQAs. Most of them are taking expertise into consideration at the query time and ignore the temporal aspects of the expert finding problem. However, considering the evolution of personal expertise over time can improve the quality of expert finding. In many applications, it is beneficial to find the potential experts in future. The proper identification of potential experts in CQA can improve their skills and the overall user participation and engagement. Considering dynamic aspects of the expert finding problem, we introduce the new problem of Future Expert Finding in this paper.Here, given the expertise evidence in current time, we aim to predict the best ranking of experts in future. We proposed a learning framework to predict such ranking on StackOverflow which is currently one of the most successful CQAs. We examine the impact of various features to predict the probability of becoming an expert user in future time. Specifically, we consider four feature groups; namely, topic similarity, emerging topics, user behavior and topic transition . The experimental results indicate the efficiency of the proposed models in comparison with several baseline models. Our experiments show that the performance of our proposed models can improve the MAP measure up to 39.7% in comparison with our best baseline method. Moreover, we found that among all of these feature groups, user behaviors have the most influence in the estimation of future expertise probability.

Journal ArticleDOI
TL;DR: A novel method for detecting event-specific and informative tweets that are likely to be beneficial for emergency response is introduced and results indicate that the proposed method is able to detect event-related tweets with about 87% accuracy in a timely manner.
Abstract: The ubiquity of smartphones and social media such as Twitter is clearly blurring traditional boundaries between producers and consumers of information. This is especially the case in emergency situations where people in the scene create and share on-the-spot information about the incident in real time. However, despite the proven importance of such platforms, finding event-related information from the real-time feeds of thousands of tweets is a significant challenge. This paper introduces a novel method for detecting event-specific and informative tweets that are likely to be beneficial for emergency response. The method investigates a sample dataset of tweets which was collected during a storm event passing over a specific area. The sample is manually labelled by three emergency management experts who annotated the sample dataset to obtain the ground truth through identification of the event-related tweets. A selected number of representative event-related tweets are used to extract the common patterns and to define event related term-classes based on term frequency analysis. The term-classes are used to evaluate the event relatedness of a sample dataset through a relationship scoring process. Consequently, each sample tweet is given an event-relatedness score which indicates how related a tweet is to the storm event. The results are compared with the ground truth to determine the cut-off relatedness score and to evaluate the performance of the method. The results of the evaluation indicate that the proposed method is able to detect event-related tweets with about 87% accuracy in a timely manner.

Journal ArticleDOI
TL;DR: This research proposes a state-of-the-art approach for paraphrase identification and semantic text similarity analysis in Arabic news tweets that adopts several phases of text processing, features extraction and text classification.
Abstract: The rapid growth in digital information has raised considerable challenges in particular when it comes to automated content analysis. Social media such as twitter share a lot of its users information about their events, opinions, personalities, etc. Paraphrase Identification (PI) is concerned with recognizing whether two texts have the same/similar meaning, whereas the Semantic Text Similarity (STS) is concerned with the degree of that similarity. This research proposes a state-of-the-art approach for paraphrase identification and semantic text similarity analysis in Arabic news tweets. The approach adopts several phases of text processing, features extraction and text classification. Lexical, syntactic, and semantic features are extracted to overcome the weakness and limitations of the current technologies in solving these tasks for the Arabic language. Maximum Entropy (MaxEnt) and Support Vector Regression (SVR) classifiers are trained using these features and are evaluated using a dataset prepared for this research. The experimentation results show that the approach achieves good results in comparison to the baseline results.

Journal ArticleDOI
TL;DR: The experimental results show the robustness of the multilingual approach (1) and also that it outperforms the monolingual models on some monolingUAL datasets.
Abstract: This article tackles the problem of performing multilingual polarity classification on Twitter, comparing three techniques: (1) a multilingual model trained on a multilingual dataset, obtained by fusing existing monolingual resources, that does not need any language recognition step, (2) a dual monolingual model with perfect language detection on monolingual texts and (3) a monolingual model that acts based on the decision provided by a language identification tool. The techniques were evaluated on monolingual, synthetic multilingual and code-switching corpora of English and Spanish tweets. In the latter case we introduce the first code-switching Twitter corpus with sentiment labels. The samples are labelled according to two well-known criteria used for this purpose: the SentiStrength scale and a trinary scale (positive, neutral and negative categories). The experimental results show the robustness of the multilingual approach (1) and also that it outperforms the monolingual models on some monolingual datasets.

Journal ArticleDOI
TL;DR: Some novel methods to Information Content (IC) computation have a good human correlation and constitute some effective ways of determining IC values for concepts and semantic similarity between concepts.
Abstract: Some novel methods to Information Content (IC) computation are proposed.The presented IC computation methods focus on a concept drawn from Wikipedia.Several approaches to semantic similarity measurement for concepts are provided. The Information Content (IC) of a concept is a fundamental dimension in computational linguistics. It enables a better understanding of concept's semantics. In the past, several approaches to compute IC of a concept have been proposed. However, there are some limitations such as the facts of relying on corpora availability, manual tagging, or predefined ontologies and fitting non-dynamic domains in the existing methods. Wikipedia provides a very large domain-independent encyclopedic repository and semantic network for computing IC of concepts with more coverage than usual ontologies. In this paper, we propose some novel methods to IC computation of a concept to solve the shortcomings of existing approaches. The presented methods focus on the IC computation of a concept (i.e., Wikipedia category) drawn from the Wikipedia category structure. We propose several new IC-based measures to compute the semantic similarity between concepts. The evaluation, based on several widely used benchmarks and a benchmark developed in ourselves, sustains the intuitions with respect to human judgments. Overall, some methods proposed in this paper have a good human correlation and constitute some effective ways of determining IC values for concepts and semantic similarity between concepts.

Journal ArticleDOI
TL;DR: TensiStrength is a system to detect the strength of stress and relaxation expressed in social media text messages using a lexical approach and a set of rules to detect direct and indirect expressions of stress or relaxation, particularly in the context of transportation.
Abstract: Computer systems need to be able to react to stress in order to perform optimally on some tasks. This article describes TensiStrength, a system to detect the strength of stress and relaxation expressed in social media text messages. TensiStrength uses a lexical approach and a set of rules to detect direct and indirect expressions of stress or relaxation, particularly in the context of transportation. It is slightly more effective than a comparable sentiment analysis program, although their similar performances occur despite differences on almost half of the tweets gathered. The effectiveness of TensiStrength depends on the nature of the tweets classified, with tweets that are rich in stress-related terms being particularly problematic. Although generic machine learning methods can give better performance than TensiStrength overall, they exploit topic-related terms in a way that may be undesirable in practical applications and that may not work as well in more focused contexts. In conclusion, TensiStrength and generic machine learning approaches work well enough to be practical choices for intelligent applications that need to take advantage of stress information, and the decision about which to use depends on the nature of the texts analysed and the purpose of the task.

Journal ArticleDOI
TL;DR: A research model is developed to investigate the factors (affective cues in particular) that drive users to instantly share information on microblogs and explores the moderating role of gender and the results confirm the positive effects of informational, ambient, and social interactivity cues on individuals positive emotion, which subsequently promotes their urge to share information to microblogs.
Abstract: An impulsive-based model is proposed to explain instant information sharing.Information uniqueness and social interactivity increase positive emotion and urge.Males are stimulated by information uniqueness and information crowding.Social interactivity plays a dominant role in sparking the urge for female users.Males experience more positive emotion and engage in impulsive information sharing. Instant information sharing on microblogs is important for promoting social awareness, influencing customer attitudes, and providing political and economic benefits. However, research on the antecedents and mechanisms of such instant information sharing is limited. To address that issue, this study develops a research model to investigate the factors (affective cues in particular) that drive users to instantly share information on microblogs and explores the moderating role of gender. An online survey was conducted on a microblogging platform to collect data for testing the proposed research model and hypotheses. The results confirm the positive effects of informational (i.e., information uniqueness), ambient (i.e., information crowding), and social (i.e., social interactivity) cues on individuals positive emotion, which subsequently promotes their urge to share information on microblogs. Moreover, the moderating effects of gender are identified. This study contributes to the understanding of instant information sharing from an impulsive behavior perspective. The results also provide important insights for service providers and practitioners who wish to promote instant information sharing on microblogs.

Journal ArticleDOI
TL;DR: This paper focuses on machine translation in the context of Arabic dialects, providing a survey of recent research in this area and providing a detailed description of the adopted approach.
Abstract: Arabic dialects also called colloquial Arabic or vernaculars are spoken varieties of Standard Arabic. These dialects have mixed form with many variations due to the influence of ancient local tongues and other languages like European ones. Many of these dialects are mutually incomprehensible. Arabic dialects were not written until recently and were used only in a speech form. Nowadays, with the advent of the internet and mobile telephony technologies, these dialects are increasingly used in a written form. Indeed, this kind of communication brought everyday conversations to a written format. This allows Arab people to use their dialects, which are their actual native languages for expressing their opinion on social media, for chatting, texting, etc. This growing use opens new research direction for Arabic natural language processing (NLP). We focus, in this paper, on machine translation in the context of Arabic dialects. We provide a survey of recent research in this area. We report for each study a detailed description of the adopted approach and we give its most relevant contribution.

Journal ArticleDOI
TL;DR: A predictive model can be used to determine the significance and impact of discovered factors on credibility evaluations, and can guide future research on the design of automatic or semi-automatic systems for Web content credibility evaluation support.
Abstract: The goal of our research is to create a predictive model of Web content credibility evaluations, based on human evaluations. The model has to be based on a comprehensive set of independent factors that can be used to guide user’s credibility evaluations in crowdsourced systems like WOT, but also to design machine classifiers of Web content credibility. The factors described in this article are based on empirical data. We have created a dataset obtained from an extensive crowdsourced Web credibility assessment study (over 15 thousand evaluations of over 5000 Web pages from over 2000 participants). First, online participants evaluated a multi-domain corpus of selected Web pages. Using the acquired data and text mining techniques we have prepared a code book and conducted another crowdsourcing round to label textual justifications of the former responses. We have extended the list of significant credibility assessment factors described in previous research and analyzed their relationships to credibility evaluation scores. Discovered factors that affect Web content credibility evaluations are also weakly correlated, which makes them more useful for modeling and predicting credibility evaluations. Based on the newly identified factors, we propose a predictive model for Web content credibility. The model can be used to determine the significance and impact of discovered factors on credibility evaluations. These findings can guide future research on the design of automatic or semi-automatic systems for Web content credibility evaluation support. This study also contributes the largest credibility dataset currently publicly available for research: the Content Credibility Corpus (C3).

Journal ArticleDOI
TL;DR: An individual user's item relations can be utilized to remedy the problems occurring when the external relations are biased or insufficient, and the suggested method performed better than the basic item-based and user-based collaborative filtering methods in terms of Accuracy, Recall, and F1 scores for top- k recommendations.
Abstract: Recommendation systems are becoming important with the increased availability of online services A typical approach used in recommendations is collaborative filtering However, because it largely relies on external relations, such as items-to-items or users-to-users, problems occur when the relations are biased or insufficient Focusing on that limitation, we here suggest a new method, item-network-based collaborative filtering, which recommends items through four steps First, the system constructs item networks based on users’ item usage history and calculates three types of centrality: betweenness, closeness, and degree Next, the system secures significant items based on the betweenness centrality of the items in each user's item network Then, by using the closeness and degree centrality of the secured items, the algorithm predicts preference scores for items and their rank orders from each user's perspective In the last step, the system organizes a recommendation list based on the predicted scores To evaluate the performance of our system, we applied it to a sample dataset of 196 Lastfm users’ listening history and compared the results with those from existing collaborative filtering methods The results showed that the suggested method performed better than the basic item-based and user-based collaborative filtering methods in terms of Accuracy, Recall, and F1 scores for top- k recommendations This indicates that an individual user's item relations can be utilized to remedy the problems occurring when the external relations are biased or insufficient

Journal ArticleDOI
TL;DR: A methodology to automatically feed a graph-based RS with features gathered from the LOD cloud is proposed and the impact of several widespread feature selection techniques in such recommendation settings is analyzed.
Abstract: We investigate the impact of the integration of the knowledge coming from the LOD cloud in a graph-based recommendation framework.We propose a methodology to automatically feed a graph-based recommendation algorithm with features coming from the LOD cloud.We give guidelines to drive the choice of the feature selection technique, according to the needs of a specic recommendation scenario (i.e., maximize accuracy, maximize diversity).We validate our methodology by evaluating its effectiveness with respect to several state-of-the-art datasets. Thanks to the recent spread of the Linked Open Data (LOD) initiative, a huge amount of machine-readable knowledge encoded as RDF statements is today available in the so-called LOD cloud. Accordingly, a big effort is now spent to investigate to what extent such information can be exploited to develop new knowledge-based services or to improve the effectiveness of knowledge-intensive platforms as Recommender Systems (RS).To this end, in this article we study the impact of the exogenous knowledge coming from the LOD cloud on the overall performance of a graph-based recommendation framework. Specifically, we propose a methodology to automatically feed a graph-based RS with features gathered from the LOD cloud and we analyze the impact of several widespread feature selection techniques in such recommendation settings.The experimental evaluation, performed on three state-of-the-art datasets, provided several outcomes: first, information extracted from the LOD cloud can significantly improve the performance of a graph-based RS. Next, experiments showed a clear correlation between the choice of the feature selection technique and the ability of the algorithm to maximize specific evaluation metrics, as accuracy or diversity of the recommendations. Moreover, our graph-based algorithm fed with LOD-based features was able to overcome several baselines, as collaborative filtering and matrix factorization.

Journal ArticleDOI
TL;DR: Examination of existing author profiling techniques for multilingual text consisting of English and Roman Urdu shows that content based methods outperform stylistic based methods for both gender and age identification task and translation of multilingual corpus to monolingual text does not improve results.
Abstract: Proposed a multilingual (Roman Urdu and English) author profiling corpus of Facebook profiles.Manually developed a bilingual dictionary (Roman Urdu to English) of 7749 entries and translated multilingual corpus using it.Applied 64 stylometry and 11 content based features on multilingual and translated corpora.Best results obtained using word bigram for age and word unigram, character 3 and 8 gram for gender identification. Author profiling is the identification of demographic features of an author by examining his written text. Recently, it has attracted the attention of research community due to its potential applications in forensic, security, marketing, fake profiles identification on online social networking sites, capturing sender of harassing messages etc. We need benchmark corpora to develop and evaluate techniques for author profiling. Majority of the existing corpora are for English and other European languages but not for underresourced South Asian languages, like Roman Urdu (written using English alphabets). Roman Urdu is used in daily communication by a large number of native speakers of Urdu around the world particularly in Facebook posts/comments, Twitter tweets, blogs, chat blogs and SMS messaging. The construction of sentences of Urdu while using alphabets of English transforms the language properties of the text. We aim to investigate the behavior of existing author profiling techniques for multilingual text consisting of English and Roman Urdu, concretely for gender and age identification. We here focus on author profiling on Facebook by (i) developing a multilingual (Roman Urdu and English) corpus, (ii) manually building of a bilingual dictionary for translating Roman Urdu words into English, (iii) modeling existing state-of-the-art author profiling techniques by using content based features (word and character Ngrams) and 64 different stylistic based features (11 lexical word based features, 47 lexical character based features and 6 vocabulary richness measures) for age and gender identification on multilingual and translated corpora, (iv) evaluating and comparing the behavior of above mentioned techniques on multilingual and translated corpora. Our extensive empirical evaluation shows that (i) existing author profiling techniques can be used for multilingual text (Roman Urdu + English) as well as monolingual text (corpus obtained after translating multilingual corpus using bilingual dictionary), (ii) content based methods outperform stylistic based methods for both gender and age identification task and (iii) translation of multilingual corpus to monolingual text does not improve results.

Journal ArticleDOI
TL;DR: A novel social recommendation method is proposed which is based on an adaptive neighbor selection mechanism and significantly outperforms several state-of-the-art recommendation methods.
Abstract: Recommender systems are techniques to make personalized recommendations of items to users. In e-commerce sites and online sharing communities, providing high quality recommendations is an important issue which can help the users to make effective decisions to select a set of items. Collaborative filtering is an important type of the recommender systems that produces user specific recommendations of the items based on the patterns of ratings or usage (e.g. purchases). However, the quality of predicted ratings and neighbor selection for the users are important problems in the recommender systems. Selecting suitable neighbors set for the users leads to improve the accuracy of ratings prediction in recommendation process. In this paper, a novel social recommendation method is proposed which is based on an adaptive neighbor selection mechanism. In the proposed method first of all, initial neighbors set of the users is calculated using clustering algorithm. In this step, the combination of historical ratings and social information between the users are used to form initial neighbors set for the users. Then, these neighbor sets are used to predict initial ratings of the unseen items. Moreover, the quality of the initial predicted ratings is evaluated using a reliability measure which is based on the historical ratings and social information between the users. Then, a confidence model is proposed to remove useless users from the initial neighbors of the users and form a new adapted neighbors set for the users. Finally, new ratings of the unseen items are predicted using the new adapted neighbors set of the users and the t o p _ N interested items are recommended to the active user. Experimental results on three real-world datasets show that the proposed method significantly outperforms several state-of-the-art recommendation methods.

Journal ArticleDOI
TL;DR: Prior domain knowledge improves older adults query and navigation strategies and copes with the age-related decline of cognitive flexibility and outperformed by young ones in open-ended information problems.
Abstract: Prior domain knowledge improves older adults query and navigation strategies and copes with the age-related decline of cognitive flexibility.Unlike prior results, older adults were outperformed by young ones in open-ended information problems.In open-ended information problems, older adults did not benefit from their prior knowledge and produced semantically less relevant queries as compared to fact-finding problems. This study focuses on the impact of age, prior domain knowledge and cognitive abilities on performance, query production and navigation strategies during information searching. Twenty older adults and nineteen young adults had to answer 12 information search problems of varying nature within two domain knowledge: health and manga. In each domain, participants had to perform two simple fact-finding problems (keywords provided and answer directly accessible on the search engine results page), two difficult fact-finding problems (keywords had to be inferred) and two open-ended information search problems (multiple answers possible and navigation necessary). Results showed that prior domain knowledge helped older adults improve navigation (i.e. reduced the number of webpages visited and thus decreased the feeling of disorientation), query production and reformulation (i.e. they formulated semantically more specific queries, and they inferred a greater number of new keywords).

Journal ArticleDOI
TL;DR: Empirically, the empirical evaluations indicate that the Canberra or Clark distance measures tend to produce better effectiveness than the rest, at least in the context of an author profiling task.
Abstract: Determining some demographics about the author of a document (e.g., gender, age) has attracted many studies during the last decade. To solve this author profiling task, various classification models have been proposed based on stylistic features (e.g., function word frequencies, n -gram of letters or words, POS distributions), as well as various vocabulary richness or overall stylistic measures. To determine the targeted category, different distance measures have been suggested without one approach clearly dominating all others. In this paper, 24 distance measures are studied, extracted from five general families of functions. Moreover, six theoretical properties are presented and we show that the Tanimoto or Matusita distance measures respect all proposed properties. To complement this analysis, 13 test collections extracted from the last CLEF evaluation campaigns are employed to evaluate empirically the effectiveness of these distance measures. This test set covers four languages (English, Spanish, Dutch, and Italian), four text genres (blogs, tweets, reviews, and social media) with respect to two genders and between four to five age groups. The empirical evaluations indicate that the Canberra or Clark distance measures tend to produce better effectiveness than the rest, at least in the context of an author profiling task. Moreover, our experiments indicate that having a training set closely related to the test set (e.g., the same collection) has a clear impact on the overall performance. The gender accuracy rate is decreased by 7% (19% for the age) when using the same text genre during the training compared to using the same collection (leaving-one-out methodology). Employing a different text genre in the training and in the test phases tends to hurt the overall performance, showing a decrease of the final accuracy rate of around 11% for the gender classification to 26% for the age.

Journal ArticleDOI
TL;DR: In this article, a hierarchical Dirichlet process (HDP) is proposed to learn sub-topics associated with sub-stories, which enables it to handle subtle variations in sub stories.
Abstract: Social media has now become the de facto information source on real world events. The challenge, however, due to the high volume and velocity nature of social media streams, is in how to follow all posts pertaining to a given event over time a task referred to as story detection. Moreover, there are often several different stories pertaining to a given event, which we refer to as sub-stories and the corresponding task of their automatic detection as sub-story detection. This paper proposes hierarchical Dirichlet processes (HDP), a probabilistic topic model, as an effective method for automatic sub-story detection. HDP can learn sub-topics associated with sub-stories which enables it to handle subtle variations in sub-stories. It is compared with state-of-the-art story detection approaches based on locality sensitive hashing and spectral clustering. We demonstrate the superior performance of HDP for sub-story detection on real world Twitter data sets using various evaluation measures. The ability of HDP to learn sub-topics helps it to recall the sub-stories with high precision. This has resulted in an improvement of up to 60% in the F-score performance of HDP based sub-story detection approach compared to standard story detection approaches. A similar performance improvement is also seen using an information theoretic evaluation measure proposed for the sub-story detection task. Another contribution of this paper is in demonstrating that considering the conversational structures within the Twitter stream can bring up to 200% improvement in sub-story detection performance.

Journal ArticleDOI
TL;DR: Analysis of the knowledge sharing behavior of users in social Q&A process in terms of their participation, interests, and connectedness finds that individuals are more willing to share their knowledge under question routing context whereas less connected.
Abstract: Evaluate the effectiveness of question routing systems in the social Q&A process.Find that individuals are more willing to share their knowledge under question routing context whereas less connected.Build an effective model to automatically identify active knowledge sharers from non-shares using non-Q&A features from four dimensions: profile, posting behavior, language style, and social activities. The increasing volume of questions posted on social question and answering sites has triggered the development of question routing services. Most of these routing algorithms are able to recognize effectively individuals with the required knowledge to answer a specific question. However, just because people have the capability to answer a question, does not mean that they have the desire to help. In this research, we evaluate the practical performance of the question routing services in social context by analyzing the knowledge sharing behavior of users in social Q&A process in terms of their participation, interests, and connectedness. We collect questions and answers over a ten-month period from Wenwo, a major Chinese question routing service. Using 340,658 questions and 1,754,280 replies, findings reveal separate roles for knowledge sharers and consumers. Based on this finding, we identify knowledge sharers from non-sharers a priori in order to increase the response probabilities. We evaluate our model based on an analysis of 3006 Wenwo knowledge sharers and non-sharers. Our experimental results demonstrate knowledge sharer prediction based solely on non-Q&A features achieves a 70% success rate in accurately identifying willing respondents.

Journal ArticleDOI
TL;DR: The Hyperlink-Induced Topic Search (HITS) enhanced variant of the AKR technique performs better than other techniques, satisfying most requirements for a reading list and provides scope for extension in future information retrieval (IR) and content-based recommender systems (RS) studies.
Abstract: The requirements for the task of building an initial reading list in literature review are re-conceptualized and a novel retrieval technique centered on author-specified keywords of papers is proposed for this task.The HITS variant of the proposed technique best satisfies the requirements of the task in an offline evaluation experiment.The proposed technique is evaluated by 132 researchers using 14 evaluation measures in a user evaluation study.Relevance, Recency and Usefulness were identified as the measures with high agreement percentages from participants.Students group were more satisfied with the results than staff group.Three predictors for user satisfaction were identified through the evaluation study. An initial reading list is prepared by researchers at the start of literature review for getting an overview of the research performed in a particular area. Prior studies have taken the approach of merely recommending seminal or popular papers to aid researchers in such a task. In this paper, we present an alternative technique called the AKR (Author-specified Keywords based Retrieval) technique for providing popular, recent, survey and a diverse set of papers as a part of the initial reading list. The AKR technique is based on a novel coverage value that has its calculation centered on author-specified keywords. We performed an offline evaluation experiment with four variants of the AKR technique along with three state-of-the-art approaches involving collaborative filtering and graph ranking algorithms. Findings show that the Hyperlink-Induced Topic Search (HITS) enhanced variant of the AKR technique performs better than other techniques, satisfying most requirements for a reading list. A user evaluation study was conducted with 132 researchers to gauge user interest on the proposed technique using 14 evaluation measures. Results show that (i) students group are more satisfied with the recommended papers than staff group, (ii) popularity measure is strongly correlated with the output quality measures and (iii) the measures familiarity, usefulness and agreeability on a good list were found to be strong predictors for user satisfaction. The AKR technique provides scope for extension in future information retrieval (IR) and content-based recommender systems (RS) studies.

Journal ArticleDOI
TL;DR: The novel aspects of this paper focus on WOM quality classification instead of traditional sentimental polarity classification, and build sentiment lexicons from the contextual information, which are adaptable to domains.
Abstract: Word of mouth (WOM), also known as the passing of information from person to person or opinionated text, has become the main information resource for consumers when making purchase decisions. Whether WOM is a valuable reference source for consumers making a purchase is determined by the quality of the WOM. WOM quality classification is useful in filtering significant WOM documents from insignificant ones, and helps consumers to make their purchase decisions more efficiently. When a consumer has a negative experience, a lower rating score and negative text are generally provided and vice versa. Regardless of the sentimental polarity, high-quality WOM (i.e. with a very high or very low rating score) has a stronger influence on consumer behavior than low-quality WOM (i.e. with a medium rating score). We build three contextual lexicons to maintain the relationship between words and their associated sentimental categories. We then apply the technique of preference vector modeling and evaluate our proposed approach by four classifiers. According to the experiments for the internet movie database (IMDb) polarity data set and hotels.com data set, the proposed contextual lexicon-concept-quality (CLCQ) and contextual lexicon-quality (CLQ) models outperform the benchmarks, i.e. the static first-sense SentiWordNet and average-sense SentiWordNet models. These results demonstrate that the proposed models can be used as a viable approach for WOM quality classification. The novel aspects of this paper are three-fold. Firstly, we focus on WOM quality classification instead of traditional sentimental polarity classification. Secondly, we build sentiment lexicons from the contextual information, which are adaptable to domains. Thirdly, we integrate these contextual sentiment lexicons with preference vector modeling for WOM quality classification and achieve an outstanding improvement.

Journal ArticleDOI
TL;DR: A modification is proposed in support vector machine based ensemble algorithm which incorporates both oversampling and undersampling to improve the prediction performance and to investigate the combined effect of machine learning classifiers and sampling methods in sentiment classification under imbalanced data distributions.
Abstract: We propose 3 vector models by varying the feature size.We propose an integrative approach for imbalanced datasets.We analyze the effect of imbalance ratio in sentiment learning.Proposed method performs more accurately than baseline models. Emerging technologies in online commerce, mobile and customer experience have transformed the retail industry so as to enable the marketers to boost sales and the customers with the most efficient online shopping. Online reviews significantly influence the purchase decisions of buyers and marketing strategies employed by vendors in e-commerce. However, the vast amount of reviews makes it difficult for the customers to mine sentiments from online reviews. To address this problem, sentiment mining system is needed to organize the online reviews automatically into different sentiment orientation categories (e.g. positive/negative). Due to the imbalanced nature of positive and negative sentiments, the real time sentiment mining is a challenging machine learning task. The main objective of this research work is to investigate the combined effect of machine learning classifiers and sampling methods in sentiment classification under imbalanced data distributions. A modification is proposed in support vector machine based ensemble algorithm which incorporates both oversampling and undersampling to improve the prediction performance. Extensive experimental comparisons are carried out to show the effectiveness of the proposed method with several other classifiers used in terms of receiver operating characteristic curve (ROC), the area under the ROC curve and geometric mean.