scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Intelligent Systems and Technology in 2012"


Journal ArticleDOI
TL;DR: The libFM as mentioned in this paper tool is a software implementation for factorization machines that features stochastic gradient descent (SGD) and alternating least-squares (ALS) optimization, as well as Bayesian inference using Markov Chain Monto Carlo (MCMC).
Abstract: Factorization approaches provide high accuracy in several important prediction problems, for example, recommender systems. However, applying factorization approaches to a new prediction problem is a nontrivial task and requires a lot of expert knowledge. Typically, a new model is developed, a learning algorithm is derived, and the approach has to be implemented.Factorization machines (FM) are a generic approach since they can mimic most factorization models just by feature engineering. This way, factorization machines combine the generality of feature engineering with the superiority of factorization models in estimating interactions between categorical variables of large domain. libFM is a software implementation for factorization machines that features stochastic gradient descent (SGD) and alternating least-squares (ALS) optimization, as well as Bayesian inference using Markov Chain Monto Carlo (MCMC). This article summarizes the recent research on factorization machines both in terms of modeling and learning, provides extensions for the ALS and MCMC algorithms, and describes the software tool libFM.

1,271 citations


Journal ArticleDOI
TL;DR: This article provides a comprehensive review of the methods that have been proposed for music emotion recognition and concludes with suggestions for further research.
Abstract: The proliferation of MP3 players and the exploding amount of digital music content call for novel ways of music organization and retrieval to meet the ever-increasing demand for easy and effective information access. As almost every music piece is created to convey emotion, music organization and retrieval by emotion is a reasonable way of accessing music information. A good deal of effort has been made in the music information retrieval community to train a machine to automatically recognize the emotion of a music signal. A central issue of machine recognition of music emotion is the conceptualization of emotion and the associated emotion taxonomy. Different viewpoints on this issue have led to the proposal of different ways of emotion annotation, model training, and result visualization. This article provides a comprehensive review of the methods that have been proposed for music emotion recognition. Moreover, as music emotion recognition is still in its infancy, there are many open issues. We review the solutions that have been proposed to address these issues and conclude with suggestions for further research.

340 citations


Journal ArticleDOI
TL;DR: This study aims to leverage the wealth of these enriched online photos to analyze people’s travel patterns at the local level of a tour destination by building a statistically reliable database of travel paths from a noisy pool of community-contributed geotagged photos on the Internet.
Abstract: Recently, the phenomenal advent of photo-sharing services, such as Flickr and Panoramio, have led to volumous community-contributed photos with text tags, timestamps, and geographic references on the Internet. The photos, together with their time- and geo-references, become the digital footprints of photo takers and implicitly document their spatiotemporal movements. This study aims to leverage the wealth of these enriched online photos to analyze people’s travel patterns at the local level of a tour destination. Specifically, we focus our analysis on two aspects: (1) tourist movement patterns in relation to the regions of attractions (RoA), and (2) topological characteristics of travel routes by different tourists. To do so, we first build a statistically reliable database of travel paths from a noisy pool of community-contributed geotagged photos on the Internet. We then investigate the tourist traffic flow among different RoAs by exploiting the Markov chain model. Finally, the topological characteristics of travel routes are analyzed by performing a sequence clustering on tour routes. Testings on four major cities demonstrate promising results of the proposed system.

223 citations


Journal ArticleDOI
TL;DR: A general methodology for inferring the occurrence and magnitude of an event or phenomenon by exploring the rich amount of unstructured textual information on the social part of the Web by investigating two case studies of geo-tagged user posts on the microblogging service of Twitter.
Abstract: We present a general methodology for inferring the occurrence and magnitude of an event or phenomenon by exploring the rich amount of unstructured textual information on the social part of the Web. Having geo-tagged user posts on the microblogging service of Twitter as our input data, we investigate two case studies. The first consists of a benchmark problem, where actual levels of rainfall in a given location and time are inferred from the content of tweets. The second one is a real-life task, where we infer regional Influenza-like Illness rates in the effort of detecting timely an emerging epidemic disease. Our analysis builds on a statistical learning framework, which performs sparse learning via the bootstrapped version of LASSO to select a consistent subset of textual features from a large amount of candidates. In both case studies, selected features indicate close semantic correlation with the target topics and inference, conducted by regression, has a significant performance, especially given the short length --approximately one year-- of Twitter’s data time series.

209 citations


Journal ArticleDOI
TL;DR: The results demonstrate that the proposed algorithm, even though unsupervised, outperforms machine learning solutions in the majority of cases, overall presenting a very robust and reliable solution for sentiment analysis of informal communication on the Web.
Abstract: Sentiment analysis is a growing area of research with significant applications in both industry and academia. Most of the proposed solutions are centered around supervised, machine learning approaches and review-oriented datasets. In this article, we focus on the more common informal textual communication on the Web, such as online discussions, tweets and social network comments and propose an intuitive, less domain-specific, unsupervised, lexicon-based approach that estimates the level of emotional intensity contained in text in order to make a prediction. Our approach can be applied to, and is tested in, two different but complementary contexts: subjectivity detection and polarity classification. Extensive experiments were carried on three real-world datasets, extracted from online social Web sites and annotated by human evaluators, against state-of-the-art supervised approaches. The results demonstrate that the proposed algorithm, even though unsupervised, outperforms machine learning solutions in the majority of cases, overall presenting a very robust and reliable solution for sentiment analysis of informal communication on the Web.

203 citations


Journal ArticleDOI
TL;DR: A novel concept of review graph is proposed to capture the relationships among all reviewers, reviews and stores that the reviewers have reviewed as a heterogeneous graph and explores how interactions between nodes in this graph could reveal the cause of spam and proposes an iterative computation model to identify suspicious reviewers.
Abstract: Online shopping reviews provide valuable information for customers to compare the quality of products, store services, and many other aspects of future purchases. However, spammers are joining this community trying to mislead consumers by writing fake or unfair reviews to confuse the consumers. Previous attempts have used reviewers’ behaviors such as text similarity and rating patterns, to detect spammers. These studies are able to identify certain types of spammers, for instance, those who post many similar reviews about one target. However, in reality, there are other kinds of spammers who can manipulate their behaviors to act just like normal reviewers, and thus cannot be detected by the available techniques.In this article, we propose a novel concept of review graph to capture the relationships among all reviewers, reviews and stores that the reviewers have reviewed as a heterogeneous graph. We explore how interactions between nodes in this graph could reveal the cause of spam and propose an iterative computation model to identify suspicious reviewers. In the review graph, we have three kinds of nodes, namely, reviewer, review, and store. We capture their relationships by introducing three fundamental concepts, the trustiness of reviewers, the honesty of reviews, and the reliability of stores, and identifying their interrelationships: a reviewer is more trustworthy if the person has written more honesty reviews; a store is more reliable if it has more positive reviews from trustworthy reviewers; and a review is more honest if many other honest reviews support it. This is the first time such intricate relationships have been identified for spam detection and captured in a graph model. We further develop an effective computation method based on the proposed graph model. Different from any existing approaches, we do not use an review text information. Our model is thus complementary to existing approaches and able to find more difficult and subtle spamming activities, which are agreed upon by human judges after they evaluate our results.

178 citations


Journal ArticleDOI
Shixia Liu1, Michelle X. Zhou2, Shimei Pan2, Yangqiu Song1, Weihong Qian2, Weijia Cai2, Xiaoxiao Lian2 
TL;DR: An enhanced, LDA-based topic analysis technique is introduced that automatically derives a set of topics to summarize a collection of documents and their content evolution over time and an effective visual metaphor is developed to transform abstract and often complex text summarization results into a comprehensible visual representation.
Abstract: We are building an interactive visual text analysis tool that aids users in analyzing large collections of text. Unlike existing work in visual text analytics, which focuses either on developing sophisticated text analytic techniques or inventing novel text visualization metaphors, ours tightly integrates state-of-the-art text analytics with interactive visualization to maximize the value of both. In this article, we present our work from two aspects. We first introduce an enhanced, LDA-based topic analysis technique that automatically derives a set of topics to summarize a collection of documents and their content evolution over time. To help users understand the complex summarization results produced by our topic analysis technique, we then present the design and development of a time-based visualization of the results. Furthermore, we provide users with a set of rich interaction tools that help them further interpret the visualized results in context and examine the text collection from multiple perspectives. As a result, our work offers three unique contributions. First, we present an enhanced topic modeling technique to provide users with a time-sensitive and more meaningful text summary. Second, we develop an effective visual metaphor to transform abstract and often complex text summarization results into a comprehensible visual representation. Third, we offer users flexible visual interaction tools as alternatives to compensate for the deficiencies of current text summarization techniques. We have applied our work to a number of text corpora and our evaluation shows promise, especially in support of complex text analyses.

168 citations


Journal ArticleDOI
TL;DR: A discussion of the design and implementation choices for each visual analysis technique is presented, followed by a discussion of three diverse use cases in which TopicNets enables fast discovery of information that is otherwise hard to find.
Abstract: We present TopicNets, a Web-based system for visual and interactive analysis of large sets of documents using statistical topic models A range of visualization types and control mechanisms to support knowledge discovery are presented These include corpus- and document-specific views, iterative topic modeling, search, and visual filtering Drill-down functionality is provided to allow analysts to visualize individual document sections and their relations within the global topic space Analysts can search across a dataset through a set of expansion techniques on selected document and topic nodes Furthermore, analysts can select relevant subsets of documents and perform real-time topic modeling on these subsets to interactively visualize topics at various levels of granularity, allowing for a better understanding of the documents A discussion of the design and implementation choices for each visual analysis technique is presented This is followed by a discussion of three diverse use cases in which TopicNets enables fast discovery of information that is otherwise hard to find These include a corpus of 50,000 successful NSF grant proposals, 10,000 publications from a large research center, and single documents including a grant proposal and a PhD thesis

163 citations


Journal ArticleDOI
TL;DR: An overview of the AEGIS automated targeting capability is provided and how it is currently being used onboard the MER mission Opportunity rover is described.
Abstract: The Autonomous Exploration for Gathering Increased Science (AEGIS) system enables automated data collection by planetary rovers. AEGIS software was uploaded to the Mars Exploration Rover (MER) mission’s Opportunity rover in December 2009 and has successfully demonstrated automated onboard targeting based on scientist-specified objectives. Prior to AEGIS, images were transmitted from the rover to the operations team on Earth; scientists manually analyzed the images, selected geological targets for the rover’s remote-sensing instruments, and then generated a command sequence to execute the new measurements. AEGIS represents a significant paradigm shift---by using onboard data analysis techniques, the AEGIS software uses scientist input to select high-quality science targets with no human in the loop. This approach allows the rover to autonomously select and sequence targeted observations in an opportunistic fashion, which is particularly applicable for narrow field-of-view instruments (such as the MER Mini-TES spectrometer, the MER Panoramic camera, and the 2011 Mars Science Laboratory (MSL) ChemCam spectrometer). This article provides an overview of the AEGIS automated targeting capability and describes how it is currently being used onboard the MER mission Opportunity rover.

102 citations


Journal ArticleDOI
TL;DR: The model separates the concepts of community and topic, so one community can correspond to multiple topics and multiple communities can share the same topic, and confirms the hypothesis that topics could help understand community structure, while community structure could help model topics.
Abstract: This article studies the problem of latent community topic analysis in text-associated graphs. With the development of social media, a lot of user-generated content is available with user networks. Along with rich information in networks, user graphs can be extended with text information associated with nodes. Topic modeling is a classic problem in text mining and it is interesting to discover the latent topics in text-associated graphs. Different from traditional topic modeling methods considering links, we incorporate community discovery into topic analysis in text-associated graphs to guarantee the topical coherence in the communities so that users in the same community are closely linked to each other and share common latent topics. We handle topic modeling and community discovery in the same framework. In our model we separate the concepts of community and topic, so one community can correspond to multiple topics and multiple communities can share the same topic. We compare different methods and perform extensive experiments on two real datasets. The results confirm our hypothesis that topics could help understand community structure, while community structure could help model topics.

92 citations


Journal ArticleDOI
TL;DR: In this article, a semi-supervised classification algorithm for data streams with recurring concept drifts and limited labeled data, called REDLLA, is proposed, in which a decision tree is adopted as the classification model.
Abstract: Tracking recurring concept drifts is a significant issue for machine learning and data mining that frequently appears in real-world stream classification problems. It is a challenge for many streaming classification algorithms to learn recurring concepts in a data stream environment with unlabeled data, and this challenge has received little attention from the research community. Motivated by this challenge, this article focuses on the problem of recurring contexts in streaming environments with limited labeled data. We propose a semi-supervised classification algorithm for data streams with REcurring concept Drifts and Limited LAbeled data, called REDLLA, in which a decision tree is adopted as the classification model. When growing a tree, a clustering algorithm based on k-means is installed to produce concept clusters and unlabeled data are labeled in the method of majority-class at leaves. In view of deviations between history and new concept clusters, potential concept drifts are distinguished and recurring concepts are maintained. Extensive studies on both synthetic and real-world data confirm the advantages of our REDLLA algorithm over three state-of-the-art online classification algorithms of CVFDT, DWCDS, and CDRDT and several known online semi-supervised algorithms, even in the case with more than 90p unlabeled data.

Journal ArticleDOI
TL;DR: A system for robust and fast people counting under occlusion through multiple cameras using a novel two-stage cascade-of-rejectors method and a fusion method with error tolerance to combine human detection from multiple cameras is built.
Abstract: Reliable and real-time people counting is crucial in many applications. Most previous works can only count moving people from a single camera, which cannot count still people or can fail badly when there is a crowd (i.e., heavy occlusion occurs). In this article, we build a system for robust and fast people counting under occlusion through multiple cameras. To improve the reliability of human detection from a single camera, we use a dimensionality reduction method on the multilevel edge and texture features to handle the large variations in human appearance and poses. To accelerate the detection speed, we propose a novel two-stage cascade-of-rejectors method. To handle the heavy occlusion in crowded scenes, we present a fusion method with error tolerance to combine human detection from multiple cameras. To improve the speed and accuracy of moving people counting, we combine our multiview fusion detection method with particle tracking to count the number of people moving in/out the camera view (“border control”). Extensive experiments and analyses show that our method outperforms state-of-the-art techniques in single- and multicamera datasets for both speed and reliability. We also design a deployed system for fast and reliable people (still or moving) counting by using multiple cameras.

Journal ArticleDOI
TL;DR: A new method called Multiview Metric Learning with Global consistency and Local smoothness (MVML-GL) under a semisupervised learning setting, which jointly considers global consistency and local smoothness is proposed.
Abstract: In many real-world applications, the same object may have different observations (or descriptions) from multiview observation spaces, which are highly related but sometimes look different from each other. Conventional metric-learning methods achieve satisfactory performance on distance metric computation of data in a single-view observation space, but fail to handle well data sampled from multiview observation spaces, especially those with highly nonlinear structure. To tackle this problem, we propose a new method called Multiview Metric Learning with Global consistency and Local smoothness (MVML-GL) under a semisupervised learning setting, which jointly considers global consistency and local smoothness. The basic idea is to reveal the shared latent feature space of the multiview observations by embodying global consistency constraints and preserving local geometric structures. Specifically, this framework is composed of two main steps. In the first step, we seek a global consistent shared latent feature space, which not only preserves the local geometric structure in each space but also makes those labeled corresponding instances as close as possible. In the second step, the explicit mapping functions between the input spaces and the shared latent space are learned via regularized locally linear regression. Furthermore, these two steps both can be solved by convex optimizations in closed form. Experimental results with application to manifold alignment on real-world datasets of pose and facial expression demonstrate the effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: A new approach to incorporate users’ reply relationship, conversation content and response immediacy which capture both explicit and implicit interaction between users to identify influential users of online healthcare community is proposed and a weighted social network is developed to represent the influence between users.
Abstract: Due to the revolutionary development of Web 2.0 technology, individual users have become major contributors of Web content in online social media. In light of the growing activities, how to measure a user’s influence to other users in online social media becomes increasingly important. This research need is urgent especially in the online healthcare community since positive influence can be beneficial while negative influence may cause-negative impact on other users of the same community. In this article, a research framework was proposed to study user influence within the online healthcare community. We proposed a new approach to incorporate users’ reply relationship, conversation content and response immediacy which capture both explicit and implicit interaction between users to identify influential users of online healthcare community. A weighted social network is developed to represent the influence between users. We tested our proposed techniques thoroughly on two medical support forums. Two algorithms UserRank and Weighted in-degree are benchmarked with PageRank and in-degree. Experiment results demonstrated the validity and effectiveness of our proposed approaches.

Journal ArticleDOI
TL;DR: In this article, the authors describe automatic methods and interactive visualizations that are tightly coupled with the goal to enable users to detect interesting portions of text document streams, which are derived from the sentiment, temporal density, and context coherence that comments about features for different targets (e.g., persons, institutions, product attributes, topics, etc.) have.
Abstract: This article describes automatic methods and interactive visualizations that are tightly coupled with the goal to enable users to detect interesting portions of text document streams. In this scenario the interestingness is derived from the sentiment, temporal density, and context coherence that comments about features for different targets (e.g., persons, institutions, product attributes, topics, etc.) have. Contributions are made at different stages of the visual analytics pipeline, including novel ways to visualize salient temporal accumulations for further exploration. Moreover, based on the visualization, an automatic algorithm aims to detect and preselect interesting time interval patterns for different features in order to guide analysts. The main target group for the suggested methods are business analysts who want to explore time-stamped customer feedback to detect critical issues. Finally, application case studies on two different datasets and scenarios are conducted and an extensive evaluation is provided for the presented intelligent visual interface for feature-based sentiment exploration over time.

Journal ArticleDOI
TL;DR: In this paper, a stochastic model is used to predict the popularity of news stories in Digg by distinguishing stories primarily of interest to users in the network from those of more general interest to the user community.
Abstract: The popularity of content in social media is unequally distributed, with some items receiving a disproportionate share of attention from users. Predicting which newly-submitted items will become popular is critically important for both the hosts of social media content and its consumers. Accurate and timely prediction would enable hosts to maximize revenue through differential pricing for access to content or ad placement. Prediction would also give consumers an important tool for filtering the content. Predicting the popularity of content in social media is challenging due to the complex interactions between content quality and how the social media site highlights its content. Moreover, most social media sites selectively present content that has been highly rated by similar users, whose similarity is indicated implicitly by their behavior or explicitly by links in a social network. While these factors make it difficult to predict popularity a priori, stochastic models of user behavior on these sites can allow predicting popularity based on early user reactions to new content. By incorporating the various mechanisms through which web sites display content, such models improve on predictions that are based on simply extrapolating from the early votes. Specifically, for one such site, the news aggregator Digg, we show how a stochastic model distinguishes the effect of the increased visibility due to the network from how interested users are in the content. We find a wide range of interest, distinguishing stories primarily of interest to users in the network (“niche interests”) from those of more general interest to the user community. This distinction is useful for predicting a story’s eventual popularity from users’ early reactions to the story.

Journal ArticleDOI
TL;DR: An effective appearance model based on sparse coding is proposed and applied in visual tracking to make the tracker more robust to partial occlusion, camouflage environments, pose changes, and illumination changes.
Abstract: Intelligent video surveillance is currently one of the most active research topics in computer vision, especially when facing the explosion of video data captured by a large number of surveillance cameras. As a key step of an intelligent surveillance system, robust visual tracking is very challenging for computer vision. However, it is a basic functionality of the human visual system (HVS). Psychophysical findings have shown that the receptive fields of simple cells in the visual cortex can be characterized as being spatially localized, oriented, and bandpass, and it forms a sparse, distributed representation of natural images. In this article, motivated by these findings, we propose an effective appearance model based on sparse coding and apply it in visual tracking. Specifically, we consider the responses of general basis functions extracted by independent component analysis on a large set of natural image patches as features and model the appearance of the tracked target as the probability distribution of these features. In order to make the tracker more robust to partial occlusion, camouflage environments, pose changes, and illumination changes, we further select features that are related to the target based on an entropy-gain criterion and ignore those that are not. The target is finally represented by the probability distribution of those related features. The target search is performed by minimizing the Matusita distance between the distributions of the target model and a candidate using Newton-style iterations. The experimental results validate that the proposed method is more robust and effective than three state-of-the-art methods.

Journal ArticleDOI
TL;DR: The present adaptive search engine allows for the efficient community creation and updating of social media indexes, which is able to instill and propagate deep knowledge into social media concerning the advanced search and usage of media resources.
Abstract: Effective sharing of diverse social media is often inhibited by limitations in their search and discovery mechanisms, which are particularly restrictive for media that do not lend themselves to automatic processing or indexing. Here, we present the structure and mechanism of an adaptive search engine which is designed to overcome such limitations. The basic framework of the adaptive search engine is to capture human judgment in the course of normal usage from user queries in order to develop semantic indexes which link search terms to media objects semantics. This approach is particularly effective for the retrieval of multimedia objects, such as images, sounds, and videos, where a direct analysis of the object features does not allow them to be linked to search terms, for example, nontextual/icon-based search, deep semantic search, or when search terms are unknown at the time the media repository is built. An adaptive search architecture is presented to enable the index to evolve with respect to user feedback, while a randomized query-processing technique guarantees avoiding local minima and allows the meaningful indexing of new media objects and new terms. The present adaptive search engine allows for the efficient community creation and updating of social media indexes, which is able to instill and propagate deep knowledge into social media concerning the advanced search and usage of media resources. Experiments with various relevance distribution settings have shown efficient convergence of such indexes, which enable intelligent search and sharing of social media resources that are otherwise hard to discover.

Journal ArticleDOI
TL;DR: This work presents and adopts a new technique that can be used to evaluate the usefulness of folksonomies for navigation and sheds new light on the properties and characteristics of state-of-the-art folksonomy induction algorithms.
Abstract: Algorithms for constructing hierarchical structures from user-generated metadata have caught the interest of the academic community in recent years. In social tagging systems, the output of these algorithms is usually referred to as folksonomies (from folk-generated taxonomies). Evaluation of folksonomies and folksonomy induction algorithms is a challenging issue complicated by the lack of golden standards, lack of comprehensive methods and tools as well as a lack of research and empirical/simulation studies applying these methods. In this article, we report results from a broad comparative study of state-of-the-art folksonomy induction algorithms that we have applied and evaluated in the context of five social tagging systems. In addition to adopting semantic evaluation techniques, we present and adopt a new technique that can be used to evaluate the usefulness of folksonomies for navigation. Our work sheds new light on the properties and characteristics of state-of-the-art folksonomy induction algorithms and introduces a new pragmatic approach to folksonomy evaluation, while at the same time identifying some important limitations and challenges of folksonomy evaluation. Our results show that folksonomy induction algorithms specifically developed to capture intuitions of social tagging systems outperform traditional hierarchical clustering techniques. To the best of our knowledge, this work represents the largest and most comprehensive evaluation study of state-of-the-art folksonomy induction algorithms to date.

Journal ArticleDOI
TL;DR: An efficient algorithm to optimize the objective function with a bounded approximation rate is presented and to scale to real large networks, a parallel implementation of the algorithm is developed.
Abstract: We study a novel problem of batch mode active learning for networked data. In this problem, data instances are connected with links and their labels are correlated with each other, and the goal of batch mode active learning is to exploit the link-based dependencies and node-specific content information to actively select a batch of instances to query the user for learning an accurate model to label unknown instances in the network. We present three criteria (i.e., minimum redundancy, maximum uncertainty, and maximum impact) to quantify the informativeness of a set of instances, and formalize the batch mode active learning problem as selecting a set of instances by maximizing an objective function which combines both link and content information. As solving the objective function is NP-hard, we present an efficient algorithm to optimize the objective function with a bounded approximation rate. To scale to real large networks, we develop a parallel implementation of the algorithm. Experimental results on both synthetic datasets and real-world datasets demonstrate the effectiveness and efficiency of our approach.

Journal ArticleDOI
TL;DR: This article proposes a novel algorithm for advertising keywords recommendation for short-text Web pages by leveraging the contents of Wikipedia, a user-contributed online encyclopedia, and proposes to use a content-biased PageRank on the Wikipedia graph to rank the related entities.
Abstract: Advertising keywords recommendation is an indispensable component for online advertising with the keywords selected from the target Web pages used for contextual advertising or sponsored search. Several ranking-based algorithms have been proposed for recommending advertising keywords. However, for most of them performance is still lacking, especially when dealing with short-text target Web pages, that is, those containing insufficient textual information for ranking. In some cases, short-text Web pages may not even contain enough keywords for selection. A natural alternative is then to recommend relevant keywords not present in the target Web pages. In this article, we propose a novel algorithm for advertising keywords recommendation for short-text Web pages by leveraging the contents of Wikipedia, a user-contributed online encyclopedia. Wikipedia contains numerous entities with related entities on a topic linked to each other. Given a target Web page, we propose to use a content-biased PageRank on the Wikipedia graph to rank the related entities. Furthermore, in order to recommend high-quality advertising keywords, we also add an advertisement-biased factor into our model. With these two biases, advertising keywords that are both relevant to a target Web page and valuable for advertising are recommended. In our experiments, several state-of-the-art approaches for keyword recommendation are compared. The experimental results demonstrate that our proposed approach produces substantial improvement in the precision of the top 20 recommended keywords on short-text Web pages over existing approaches.

Journal ArticleDOI
TL;DR: The success of simple methods for classification shows that is is often not necessary to model complex attribute interactions to obtain good classification accuracy on practical problems, so an ensemble of Hoeffding trees that are each limited to a small subset of attributes is proposed.
Abstract: The success of simple methods for classification shows that is is often not necessary to model complex attribute interactions to obtain good classification accuracy on practical problems. In this article, we propose to exploit this phenomenon in the data stream context by building an ensemble of Hoeffding trees that are each limited to a small subset of attributes. In this way, each tree is restricted to model interactions between attributes in its corresponding subset. Because it is not known a priori which attribute subsets are relevant for prediction, we build exhaustive ensembles that consider all possible attribute subsets of a given size. As the resulting Hoeffding trees are not all equally important, we weigh them in a suitable manner to obtain accurate classifications. This is done by combining the log-odds of their probability estimates using sigmoid perceptrons, with one perceptron per class. We propose a mechanism for setting the perceptrons’ learning rate using the change detection method for data streams, and also use to reset ensemble members (i.e., Hoeffding trees) when they no longer perform well. Our experiments show that the resulting ensemble classifier outperforms bagging for data streams in terms of accuracy when both are used in conjunction with adaptive naive Bayes Hoeffding trees, at the expense of runtime and memory consumption. We also show that our stacking method can improve the performance of a bagged ensemble.

Journal ArticleDOI
TL;DR: This article considers a transfer-learning setting in which some related source tasks with labeled data are available to help the learning of the target task, and proposes a semi-supervised extension of TML called STML to further improve the generalization performance by exploiting the unlabeled data based on the manifold assumption.
Abstract: Distance metric learning plays a very crucial role in many data mining algorithms because the performance of an algorithm relies heavily on choosing a good metric. However, the labeled data available in many applications is scarce, and hence the metrics learned are often unsatisfactory. In this article, we consider a transfer-learning setting in which some related source tasks with labeled data are available to help the learning of the target task. We first propose a convex formulation for multitask metric learning by modeling the task relationships in the form of a task covariance matrix. Then we regard transfer learning as a special case of multitask learning and adapt the formulation of multitask metric learning to the transfer-learning setting for our method, called transfer metric learning (TML). In TML, we learn the metric and the task covariances between the source tasks and the target task under a unified convex formulation. To solve the convex optimization problem, we use an alternating method in which each subproblem has an efficient solution. Moreover, in many applications, some unlabeled data is also available in the target task, and so we propose a semi-supervised extension of TML called STML to further improve the generalization performance by exploiting the unlabeled data based on the manifold assumption. Experimental results on some commonly used transfer-learning applications demonstrate the effectiveness of our method.

Journal ArticleDOI
TL;DR: The contribution of the new novelty measures to estimating blog-post popularity by predicting the number of comments expected for a fresh post is demonstrated and how novelty based measures can be utilized for predicting the citation volume of academic papers is demonstrated.
Abstract: This work deals with the task of predicting the popularity of user-generated content. We demonstrate how the novelty of newly published content plays an important role in affecting its popularity. More specifically, we study three dimensions of novelty. The first one, termed contemporaneous novelty, models the relative novelty embedded in a new post with respect to contemporary content that was generated by others. The second type of novelty, termed self novelty, models the relative novelty with respect to the user’s own contribution history. The third type of novelty, termed discussion novelty, relates to the novelty of the comments associated by readers with respect to the post content. We demonstrate the contribution of the new novelty measures to estimating blog-post popularity by predicting the number of comments expected for a fresh post. We further demonstrate how novelty based measures can be utilized for predicting the citation volume of academic papers.

Journal ArticleDOI
TL;DR: The most important retrieval tasks related to comments, namely filtering, ranking, and summarization are identified, and two paradigms according to which comments are utilized and which are designated as comment-targeting and comment-exploiting are distinguished.
Abstract: This article studies information retrieval tasks related to Web comments. Prerequisite of such a study and a main contribution of the article is a unifying survey of the research field. We identify the most important retrieval tasks related to comments, namely filtering, ranking, and summarization. Within these tasks, we distinguish two paradigms according to which comments are utilized and which we designate as comment-targeting and comment-exploiting. Within the first paradigm, the comments themselves form the retrieval targets. Within the second paradigm, the commented items form the retrieval targets (i.e., comments are used as an additional information source to improve the retrieval performance for the commented items). We report on four case studies to demonstrate the exploration of the commentsphere under information retrieval aspects: comment filtering, comment ranking, comment summarization and cross-media retrieval. The first three studies deal primarily with comment-targeting retrieval, while the last one deals with comment-exploiting retrieval. Throughout the article, connections to information retrieval research are pointed out.

Journal ArticleDOI
TL;DR: A visual analytics system for news streams which can bring multiple attributes of the news articles and the macro/micro relations between news streams and keywords into one coherent analytical context, all the while conveying the dynamic natures of news streams.
Abstract: Keyword-based searching and clustering of news articles have been widely used for news analysis. However, news articles usually have other attributes such as source, author, date and time, length, and sentiment which should be taken into account. In addition, news articles and keywords have complicated macro/micro relations, which include relations between news articles (i.e., macro relation), relations between keywords (i.e., micro relation), and relations between news articles and keywords (i.e., macro-micro relation). These macro/micro relations are time varying and pose special challenges for news analysis.In this article we present a visual analytics system for news streams which can bring multiple attributes of the news articles and the macro/micro relations between news streams and keywords into one coherent analytical context, all the while conveying the dynamic natures of news streams. We introduce a new visualization primitive called TextWheel which consists of one or multiple keyword wheels, a document transportation belt, and a dynamic system which connects the wheels and belt. By observing the TextWheel and its content changes, some interesting patterns can be detected. We use our system to analyze several news corpora related to some major companies and the results demonstrate the high potential of our method.

Journal ArticleDOI
TL;DR: DClusterE integrates cluster validation with user interactions and offers rich visualization tools for users to examine document clustering results from multiple perspectives, and provides not only different aspects of document inter/intra-clustering structures, but also the corresponding relationship between clusters results and the ground truth.
Abstract: Over the last decade, document clustering, as one of the key tasks in information organization and navigation, has been widely studied. Many algorithms have been developed for addressing various challenges in document clustering and for improving clustering performance. However, relatively few research efforts have been reported on evaluating and understanding document clustering results. In this article, we present DClusterE, a comprehensive and effective framework for document clustering evaluation and understanding using information visualization. DClusterE integrates cluster validation with user interactions and offers rich visualization tools for users to examine document clustering results from multiple perspectives. In particular, through informative views including force-directed layout view, matrix view, and cluster view, DClusterE provides not only different aspects of document inter/intra-clustering structures, but also the corresponding relationship between clustering results and the ground truth. Additionally, DClusterE supports general user interactions such as zoom in/out, browsing, and interactive access of the documents at different levels. Two new techniques are proposed to implement DClusterE: (1) A novel multiplicative update algorithm (MUA) for matrix reordering to generate narrow-banded (or clustered) nonzero patterns from documents. Combined with coarse seriation, MUA is able to provide better visualization of the cluster structures. (2) A Mallows-distance-based algorithm for establishing the relationship between the clustering results and the ground truth, which serves as the basis for coloring schemes. Experiments and user studies are conducted to demonstrate the effectiveness and efficiency of DClusterE.

Journal ArticleDOI
David Carmel1, Erel Uziel1, Ido Guy1, Yosi Mass1, Haggai Roitman1 
TL;DR: This work presents a folksonomy-based term extraction method, called tag-boost, which boosts terms that are frequently used by the public to tag content, which can be effectively applied even in nontagged domains, by using an external rich folksonomy borrowed from a well-tagged domain.
Abstract: In this work we study the task of term extraction for word cloud generation in sparsely tagged domains, in which manual tags are scarce. We present a folksonomy-based term extraction method, called tag-boost, which boosts terms that are frequently used by the public to tag content. Our experiments with tag-boost based term extraction over different domains demonstrate tremendous improvement in word cloud quality, as reflected by the agreement between manual tags of the testing items and the cloud’s terms extracted from the items’ content. Moreover, our results demonstrate the high robustness of this approach, as compared to alternative cloud generation methods that exhibit a high sensitivity to data sparseness. Additionally, we show that tag-boost can be effectively applied even in nontagged domains, by using an external rich folksonomy borrowed from a well-tagged domain.

Journal ArticleDOI
TL;DR: To study fresh impact craters, dust devil tracks, and dark slope streaks on Mars, a new approach to orbital image analysis called dynamic landmarking is introduced, which focuses on the identification and comparison of visually salient features in images.
Abstract: Given the large volume of images being sent back from remote spacecraft, there is a need for automated analysis techniques that can quickly identify interesting features in those images. Feature identification in individual images and automated change detection in multiple images of the same target are valuable for scientific studies and can inform subsequent target selection. We introduce a new approach to orbital image analysis called dynamic landmarking. It focuses on the identification and comparison of visually salient features in images. We have evaluated this approach on images collected by five Mars orbiters. These evaluations were motivated by three scientific goals: to study fresh impact craters, dust devil tracks, and dark slope streaks on Mars. In the process we also detected a different kind of surface change that may indicate seasonally exposed bedforms. These experiences also point the way to how this approach could be used in an onboard setting to analyze and prioritize data as it is collected.

Journal ArticleDOI
TL;DR: A ranking framework for general entity-relationship queries and a position-based Bounded Cumulative Model (BCM) for accurate ranking of query answers are presented and various weighting schemes for further improving the accuracy of BCM are explored.
Abstract: Wikipedia is the largest user-generated knowledge base. We propose a structured query mechanism, entity-relationship query, for searching entities in the Wikipedia corpus by their properties and interrelationships. An entity-relationship query consists of multiple predicates on desired entities. The semantics of each predicate is specified with keywords. Entity-relationship query searches entities directly over text instead of preextracted structured data stores. This characteristic brings two benefits: (1) Query semantics can be intuitively expressed by keywords; (2) It only requires rudimentary entity annotation, which is simpler than explicitly extracting and reasoning about complex semantic information before query-time. We present a ranking framework for general entity-relationship queries and a position-based Bounded Cumulative Model (BCM) for accurate ranking of query answers. We also explore various weighting schemes for further improving the accuracy of BCM. We test our ideas on a 2008 version of Wikipedia using a collection of 45 queries pooled from INEX entity ranking track and our own crafted queries. Experiments show that the ranking and weighting schemes are both effective, particularly on multipredicate queries.