Showing papers in "ACM Transactions on Intelligent Systems and Technology in 2011"
TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Abstract: LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
TL;DR: This article performs two types of travel recommendations by mining multiple users' GPS traces, including a generic one that recommends a user with top interesting locations and travel sequences in a given geospatial region and a personalized recommendation that provides an individual with locations matching her travel preferences.
Abstract: The advance of GPS-enabled devices allows people to record their location histories with GPS traces, which imply human behaviors and preferences related to travel. In this article, we perform two types of travel recommendations by mining multiple users' GPS traces. The first is a generic one that recommends a user with top interesting locations and travel sequences in a given geospatial region. The second is a personalized recommendation that provides an individual with locations matching her travel preferences. To achieve the first recommendation, we model multiple users' location histories with a tree-based hierarchical graph (TBHG). Based on the TBHG, we propose a HITS (Hypertext Induced Topic Search)-based model to infer the interest level of a location and a user's travel experience (knowledge). In the personalized recommendation, we first understand the correlation between locations, and then incorporate this correlation into a collaborative filtering (CF)-based model, which predicts a user's interests in an unvisited location based on her locations histories and that of others. We evaluated our system based on a real-world GPS trace dataset collected by 107 users over a period of one year. As a result, our HITS-based inference model outperformed baseline approaches like rank-by-count and rank-by-frequency. Meanwhile, we achieved a better performance in recommending travel sequences beyond baselines like rank-by-count. Regarding the personalized recommendation, our approach is more effective than the weighted Slope One algorithm with a slightly additional computation, and is more efficient than the Pearson correlation-based CF model with the similar effectiveness.
TL;DR: The main contribution of this article is to provide a state-of-the-art overview of current techniques while providing a critical perspective on business applications of social network analysis and mining.
Abstract: Social network analysis has gained significant attention in recent years, largely due to the success of online social networking and media-sharing sites, and the consequent availability of a wealth of social network data. In spite of the growing interest, however, there is little understanding of the potential business applications of mining social networks. While there is a large body of research on different problems and methods for social network mining, there is a gap between the techniques developed by the research community and their deployment in real-world applications. Therefore the potential business impact of these techniques is still largely unexplored.In this article we use a business process classification framework to put the research topics in a business context and provide an overview of what we consider key problems and techniques in social network analysis and mining from the perspective of business applications. In particular, we discuss data acquisition and preparation, trust, expertise, community structure, network dynamics, and information propagation. In each case we present a brief overview of the problem, describe state-of-the art approaches, discuss business application examples, and map each of the topics to a business process classification framework. In addition, we provide insights on prospective business applications, challenges, and future research directions. The main contribution of this article is to provide a state-of-the-art overview of current techniques while providing a critical perspective on business applications of social network analysis and mining.
TL;DR: An unsupervised methodology based on two differing probabilistic topic models is developed and applied to the daily life of 97 mobile phone users over a 16-month period to achieve the discovery and analysis of human routines that characterize both individual and group behaviors in terms of location patterns.
Abstract: In this work, we discover the daily location-driven routines that are contained in a massive real-life human dataset collected by mobile phones. Our goal is the discovery and analysis of human routines that characterize both individual and group behaviors in terms of location patterns. We develop an unsupervised methodology based on two differing probabilistic topic models and apply them to the daily life of 97 mobile phone users over a 16-month period to achieve these goals. Topic models are probabilistic generative models for documents that identify the latent structure that underlies a set of words. Routines dominating the entire group's activities, identified with a methodology based on the Latent Dirichlet Allocation topic model, include “going to work late”, “going home early”, “working nonstop” and “having no reception (phone off)” at different times over varying time-intervals. We also detect routines which are characteristic of users, with a methodology based on the Author-Topic model. With the routines discovered, and the two methods of characterizing days and users, we can then perform various tasks. We use the routines discovered to determine behavioral patterns of users and groups of users. For example, we can find individuals that display specific daily routines, such as “going to work early” or “turning off the mobile (or having no reception) in the evenings”. We are also able to characterize daily patterns by determining the topic structure of days in addition to determining whether certain routines occur dominantly on weekends or weekdays. Furthermore, the routines discovered can be used to rank users or find subgroups of users who display certain routines. We can also characterize users based on their entropy. We compare our method to one based on clustering using K-means. Finally, we analyze an individual's routines over time to determine regions with high variations, which may correspond to specific events.
TL;DR: This article exploits the problem of annotating a large-scale image corpus by label propagation over noisily tagged web images by proposing a novel kNN-sparse graph-based semi-supervised learning approach for harnessing the labeled and unlabeled data simultaneously.
Abstract: In this article, we exploit the problem of annotating a large-scale image corpus by label propagation over noisily tagged web images. To annotate the images more accurately, we propose a novel kNN-sparse graph-based semi-supervised learning approach for harnessing the labeled and unlabeled data simultaneously. The sparse graph constructed by datum-wise one-vs-kNN sparse reconstructions of all samples can remove most of the semantically unrelated links among the data, and thus it is more robust and discriminative than the conventional graphs. Meanwhile, we apply the approximate k nearest neighbors to accelerate the sparse graph construction without loosing its effectiveness. More importantly, we propose an effective training label refinement strategy within this graph-based learning framework to handle the noise in the training labels, by bringing in a dual regularization for both the quantity and sparsity of the noise. We conduct extensive experiments on a real-world image database consisting of 55,615 Flickr images and noisily tagged training labels. The results demonstrate both the effectiveness and efficiency of the proposed approach and its capability to deal with the noise in the training labels.
TL;DR: The models and technical advances required to satisfy the competing needs of preference modeling and elicitation, constraint reasoning, and machine learning are described and a multifaceted evaluation of the perceived usefulness of the PTIME system is reported.
Abstract: In a world of electronic calendars, the prospect of intelligent, personalized time management assistance seems a plausible and desirable application of AI. PTIME (Personalized Time Management) is a learning cognitive assistant agent that helps users handle email meeting requests, reserve venues, and schedule events. PTIME is designed to unobtrusively learn scheduling preferences, adapting to its user over time. The agent allows its user to flexibly express requirements for new meetings, as they would to an assistant. It interfaces with commercial enterprise calendaring platforms, and it operates seamlessly with users who do not have PTIME. This article overviews the system design and describes the models and technical advances required to satisfy the competing needs of preference modeling and elicitation, constraint reasoning, and machine learning. We further report on a multifaceted evaluation of the perceived usefulness of the system.
TL;DR: A survey on the efforts of leveraging active learning in multimedia annotation and retrieval, including semi-supervised learning, multilabel learning and multiple instance learning, focuses on two application domains: image/video annotation and content-based image retrieval.
Abstract: Active learning is a machine learning technique that selects the most informative samples for labeling and uses them as training data. It has been widely explored in multimedia research community for its capability of reducing human annotation effort. In this article, we provide a survey on the efforts of leveraging active learning in multimedia annotation and retrieval. We mainly focus on two application domains: image/video annotation and content-based image retrieval. We first briefly introduce the principle of active learning and then we analyze the sample selection criteria. We categorize the existing sample selection strategies used in multimedia annotation and retrieval into five criteria: risk reduction, uncertainty, diversity, density and relevance. We then introduce several classification models used in active learning-based multimedia annotation and retrieval, including semi-supervised learning, multilabel learning and multiple instance learning. We also provide a discussion on several future trends in this research direction. In particular, we discuss cost analysis of human annotation and large-scale interactive multimedia annotation.
TL;DR: This article develops a real-time system for gathering URL features and is able to train an online classifier that detects malicious Web sites with 99% accuracy over a balanced dataset.
Abstract: Malicious Web sites are a cornerstone of Internet criminal activities. The dangers of these sites have created a demand for safeguards that protect end-users from visiting them. This article explores how to detect malicious Web sites from the lexical and host-based features of their URLs. We show that this problem lends itself naturally to modern algorithms for online learning. Online algorithms not only process large numbers of URLs more efficiently than batch algorithms, they also adapt more quickly to new features in the continuously evolving distribution of malicious URLs. We develop a real-time system for gathering URL features and pair it with a real-time feed of labeled URLs from a large Web mail provider. From these features and labels, we are able to train an online classifier that detects malicious Web sites with 99p accuracy over a balanced dataset.
TL;DR: Data placement, pipeline processing, word bundling, and priority-based scheduling are proposed to improve scalability of LDA and significantly reduce the unparallelizable communication bottleneck and achieve good load balancing.
Abstract: Previous methods of distributed Gibbs sampling for LDA run into either memory or communication bottlenecks. To improve scalability, we propose four strategies: data placement, pipeline processing, word bundling, and priority-based scheduling. Experiments show that our strategies significantly reduce the unparallelizable communication bottleneck and achieve good load balancing, and hence improve scalability of LDA.
TL;DR: A comprehensive set of performance metrics and visualisations for continuous activity recognition (AR) and shows that where event- and frame-based precision and recall lead to an ambiguous interpretation of results in some cases, the proposed metrics provide a consistently unambiguous explanation.
Abstract: In this article, we introduce and evaluate a comprehensive set of performance metrics and visualisations for continuous activity recognition (AR). We demonstrate how standard evaluation methods, often borrowed from related pattern recognition problems, fail to capture common artefacts found in continuous AR—specifically event fragmentation, event merging and timing offsets. We support our assertion with an analysis on a set of recently published AR papers. Building on an earlier initial work on the topic, we develop a frame-based visualisation and corresponding set of class-skew invariant metrics for the one class versus all evaluation. These are complemented by a new complete set of event-based metrics that allow a quick graphical representation of system performance—showing events that are correct, inserted, deleted, fragmented, merged and those which are both fragmented and merged. We evaluate the utility of our approach through comparison with standard metrics on data from three different published experiments. This shows that where event- and frame-based precision and recall lead to an ambiguous interpretation of results in some cases, the proposed metrics provide a consistently unambiguous explanation.
TL;DR: Algorithm which efficiently find provably near-optimal solutions to large, complex information gathering problems, including environmental monitoring using robotic sensors, activity recognition using a built sensing chair, a sensor placement challenge, and deciding which blogs to read on the Web are described.
Abstract: Where should we place sensors to efficiently monitor natural drinking water resources for contamination? Which blogs should we read to learn about the biggest stories on the Web? These problems share a fundamental challenge: How can we obtain the most useful information about the state of the world, at minimum cost?Such information gathering, or active learning, problems are typically NP-hard, and were commonly addressed using heuristics without theoretical guarantees about the solution quality. In this article, we describe algorithms which efficiently find provably near-optimal solutions to large, complex information gathering problems. Our algorithms exploit submodularity, an intuitive notion of diminishing returns common to many sensing problems: the more sensors we have already deployed, the less we learn by placing another sensor. In addition to identifying the most informative sensing locations, our algorithms can handle more challenging settings, where sensors need to be able to reliably communicate over lossy links, where mobile robots are used for collecting data, or where solutions need to be robust against adversaries and sensor failures.We also present results applying our algorithms to several real-world sensing tasks, including environmental monitoring using robotic sensors, activity recognition using a built sensing chair, a sensor placement challenge, and deciding which blogs to read on the Web.
TL;DR: A novel probabilistic factor analysis framework which naturally fuses the users' tastes and their trusted friends' favors together and outperforms state-of-the-art approaches at modeling recommender systems more accurately and realistically.
Abstract: Recommender systems have been well studied and developed, both in academia and in industry recently. However, traditional recommender systems assume that all the users are independent and identically distributed; this assumption ignores the connections among users, which is not consistent with the real-world observations where we always turn to our trusted friends for recommendations. Aiming at modeling recommender systems more accurately and realistically, we propose a novel probabilistic factor analysis framework which naturally fuses the users' tastes and their trusted friends' favors together. The proposed framework is quite general, and it can also be applied to pure user-item rating matrix even if we do not have explicit social trust information among users. In this framework, we coin the term social trust ensemble to represent the formulation of the social trust restrictions on the recommender systems. The complexity analysis indicates that our approach can be applied to very large datasets since it scales linearly with the number of observations, while the experimental results show that our method outperforms state-of-the-art approaches.
TL;DR: A moving object data mining system, MoveMine, which integrates multiple data mining functions, including sophisticated pattern mining and trajectory analysis is introduced, which will benefit scientists and other users to carry out versatile analysis tasks to analyze object movement regularities and anomalies.
Abstract: With the maturity and wide availability of GPS, wireless, telecommunication, and Web technologies, massive amounts of object movement data have been collected from various moving object targets, such as animals, mobile devices, vehicles, and climate radars. Analyzing such data has deep implications in many applications, such as, ecological study, traffic control, mobile communication management, and climatological forecast. In this article, we focus our study on animal movement data analysis and examine advanced data mining methods for discovery of various animal movement patterns. In particular, we introduce a moving object data mining system, MoveMine, which integrates multiple data mining functions, including sophisticated pattern mining and trajectory analysis. In this system, two interesting moving object pattern mining functions are newly developed: (1) periodic behavior mining and (2) swarm pattern mining. For mining periodic behaviors, a reference location-based method is developed, which first detects the reference locations, discovers the periods in complex movements, and then finds periodic patterns by hierarchical clustering. For mining swarm patterns, an efficient method is developed to uncover flexible moving object clusters by relaxing the popularly-enforced collective movement constraints.In the MoveMine system, a set of commonly used moving object mining functions are built and a user-friendly interface is provided to facilitate interactive exploration of moving object data mining and flexible tuning of the mining constraints and parameters. MoveMine has been tested on multiple kinds of real datasets, especially for MoveBank applications and other moving object data analysis. The system will benefit scientists and other users to carry out versatile analysis tasks to analyze object movement regularities and anomalies. Moreover, it will benefit researchers to realize the importance and limitations of current techniques and promote future studies on moving object data mining. As expected, a mastery of animal movement patterns and trends will improve our understanding of the interactions between and the changes of the animal world and the ecosystem and therefore help ensure the sustainability of our ecosystem.
TL;DR: A novel control mechanism that models and controls a system comprising of a green energy supplier operating within the grid and a number of individual homes, and a new carbon-based pricing mechanism that takes advantage of carbon-intensity signals available on the Internet in order to provide real-time pricing.
Abstract: With dwindling nonrenewable energy reserves and the adverse effects of climate change, the development of the smart electricity grid is seen as key to solving global energy security issues and to reducing carbon emissions. In this respect, there is a growing need to integrate renewable (or green) energy sources in the grid. However, the intermittency of these energy sources requires that demand must also be made more responsive to changes in supply, and a number of smart grid technologies are being developed, such as high-capacity batteries and smart meters for the home, to enable consumers to be more responsive to conditions on the grid in real time. Traditional solutions based on these technologies, however, tend to ignore the fact that individual consumers will behave in such a way that best satisfies their own preferences to use or store energy (as opposed to that of the supplier or the grid operator). Hence, in practice, it is unclear how these solutions will cope with large numbers of consumers using their devices in this way. Against this background, in this article, we develop novel control mechanisms based on the use of autonomous agents to better incorporate consumer preferences in managing demand. These agents, residing on consumers' smart meters, can both communicate with the grid and optimize their owner's energy consumption to satisfy their preferences. More specifically, we provide a novel control mechanism that models and controls a system comprising of a green energy supplier operating within the grid and a number of individual homes (each possibly owning a storage device). This control mechanism is based on the concept of homeostasis whereby control signals are sent to individual components of a system, based on their continuous feedback, in order to change their state so that the system may reach a stable equilibrium. Thus, we define a new carbon-based pricing mechanism for this green energy supplier that takes advantage of carbon-intensity signals available on the Internet in order to provide real-time pricing. The pricing scheme is designed in such a way that it can be readily implemented using existing communication technologies and is easily understandable by consumers. Building upon this, we develop new control signals that the supplier can use to incentivize agents to shift demand (using their storage device) to times when green energy is available. Moreover, we show how these signals can be adapted according to changes in supply and to various degrees of penetration of storage in the system. We empirically evaluate our system and show that, when all homes are equipped with storage devices, the supplier can significantly reduce its reliance on other carbon-emitting power sources to cater for its own shortfalls. By so doing, the supplier reduces the carbon emission of the system by up to 25p while the consumer reduces its costs by up to 14.5p. Finally, we demonstrate that our homeostatic control mechanism is not sensitive to small prediction errors and the supplier is incentivized to accurately predict its green production to minimize costs.
TL;DR: In this article, a learning-to-trade algorithm termed CORrelation-driven nonparametric learning strategy (CORN) was proposed for actively trading stocks. But, the performance of CORN was evaluated on several large historical and latest real stock markets, and showed that it can easily beat both the market index and the best stock in the market substantially.
Abstract: Machine learning techniques have been adopted to select portfolios from financial markets in some emerging intelligent business applications. In this article, we propose a novel learning-to-trade algorithm termed CORrelation-driven Nonparametric learning strategy (CORN) for actively trading stocks. CORN effectively exploits statistical relations between stock market windows via a nonparametric learning approach. We evaluate the empirical performance of our algorithm extensively on several large historical and latest real stock markets, and show that it can easily beat both the market index and the best stock in the market substantially (without or with small transaction costs), and also surpass a variety of state-of-the-art techniques significantly.
TL;DR: New methods for inferring colocation and conversation networks from privacy-sensitive audio are applied in a study of face-to-face interactions among 24 students in a graduate school cohort during an academic year, and show that networks derived from colocations and conversation inferences are quite different.
Abstract: New technologies have made it possible to collect information about social networks as they are acted and observed in the wild, instead of as they are reported in retrospective surveys. These technologies offer opportunities to address many new research questions: How can meaningful information about social interaction be extracted from automatically recorded raw data on human behaviorq What can we learn about social networks from such fine-grained behavioral dataq And how can all of this be done while protecting privacyq With the goal of addressing these questions, this article presents new methods for inferring colocation and conversation networks from privacy-sensitive audio. These methods are applied in a study of face-to-face interactions among 24 students in a graduate school cohort during an academic year. The resulting analysis shows that networks derived from colocation and conversation inferences are quite different. This distinction can inform future research in computational social science, especially work that only measures colocation or employs colocation data as a proxy for conversation networks.
TL;DR: A new approach based on feature mining on the discrete cosine transform (DCT) domain and machine learning for steganalysis of JPEG images is proposed and prominently outperforms the well-known Markov-process based approach.
Abstract: The threat posed by hackers, spies, terrorists, and criminals, etc. using steganography for stealthy communications and other illegal purposes is a serious concern of cyber security. Several steganographic systems that have been developed and made readily available utilize JPEG images as carriers. Due to the popularity of JPEG images on the Internet, effective steganalysis techniques are called for to counter the threat of JPEG steganography. In this article, we propose a new approach based on feature mining on the discrete cosine transform (DCT) domain and machine learning for steganalysis of JPEG images. First, neighboring joint density features on both intra-block and inter-block are extracted from the DCT coefficient array and the absolute array, respectively; then a support vector machine (SVM) is applied to the features for detection. An evolving neural-fuzzy inference system is employed to predict the hiding amount in JPEG steganograms. We also adopt a feature selection method of support vector machine recursive feature elimination to reduce the number of features. Experimental results show that, in detecting several JPEG-based steganographic systems, our method prominently outperforms the well-known Markov-process based approach.
TL;DR: This article presents an interactive image search system, image search by color map, which can be applied to, but not limited to, enhance text-based image search, and proposes a simple but effective scheme to mine the latent search intention from the user’s input, and exploit the dominant color filter strategy to make the system more efficient.
Abstract: The availability of large-scale images from the Internet has made the research on image search attract a lot of attention. Text-based image search engines, for example, Google/Microsoft Bing/Yahoo! image search engines using the surrounding text, have been developed and widely used. However, they suffer from an inability to search image content. In this article, we present an interactive image search system, image search by color map, which can be applied to, but not limited to, enhance text-based image search. This system enables users to indicate how the colors are spatially distributed in the desired images, by scribbling a few color strokes, or dragging an image and highlighting a few regions of interest in an intuitive way. In contrast to the conventional sketch-based image retrieval techniques, our system searches images based on colors rather than shapes, and we, technically, propose a simple but effective scheme to mine the latent search intention from the user’s input, and exploit the dominant color filter strategy to make our system more efficient. We integrate our system to existing Web image search engines to demonstrate its superior performance over text-based image search. The user study shows that our system can indeed help users conveniently find desired images.
TL;DR: This paper proposed an extension of Structural Correspondence Learning (SCL), a recently proposed algorithm for domain adaptation, for cross-lingual adaptation in the context of text classification, which uses unlabeled documents from both languages, along with a word translation oracle, to induce a crosslingual representation that enables the transfer of classification knowledge from the source to the target language.
Abstract: Cross-lingual adaptation is a special case of domain adaptation and refers to the transfer of classification knowledge between two languages. In this article we describe an extension of Structural Correspondence Learning (SCL), a recently proposed algorithm for domain adaptation, for cross-lingual adaptation in the context of text classification. The proposed method uses unlabeled documents from both languages, along with a word translation oracle, to induce a cross-lingual representation that enables the transfer of classification knowledge from the source to the target language. The main advantages of this method over existing methods are resource efficiency and task specificity.We conduct experiments in the area of cross-language topic and sentiment classification involving English as source language and German, French, and Japanese as target languages. The results show a significant improvement of the proposed method over a machine translation baseline, reducing the relative error due to cross-lingual adaptation by an average of 30p (topic classification) and 59p (sentiment classification). We further report on empirical analyses that reveal insights into the use of unlabeled data, the sensitivity with respect to important hyperparameters, and the nature of the induced cross-lingual word correspondences.
TL;DR: An integrated framework on autodetection of subkilometer craters with boosting and transfer learning that can achieve an F1 score above 0.85, a significant improvement over the other crater detection algorithms.
Abstract: Counting craters in remotely sensed images is the only tool that provides relative dating of remote planetary surfaces. Surveying craters requires counting a large amount of small subkilometer craters, which calls for highly efficient automatic crater detection. In this article, we present an integrated framework on autodetection of subkilometer craters with boosting and transfer learning. The framework contains three key components. First, we utilize mathematical morphology to efficiently identify crater candidates, the regions of an image that can potentially contain craters. Only those regions occupying relatively small portions of the original image are the subjects of further processing. Second, we extract and select image texture features, in combination with supervised boosting ensemble learning algorithms, to accurately classify crater candidates into craters and noncraters. Third, we integrate transfer learning into boosting, to enhance detection performance in the regions where surface morphology differs from what is characterized by the training set. Our framework is evaluated on a large test image of 37,500 × 56,250 m2 on Mars, which exhibits a heavily cratered Martian terrain characterized by nonuniform surface morphology. Empirical studies demonstrate that the proposed crater detection framework can achieve an F1 score above 0.85, a significant improvement over the other crater detection algorithms.
TL;DR: This article shows that information from the friendship network can indeed be fruitfully exploited in making affiliation recommendations, and suggests two models of user-community affinity for the purpose of making affiliation recommendation: one based on graph proximity, and another using latent factors to model users and communities.
Abstract: Social network analysis has attracted increasing attention in recent years In many social networks, besides friendship links among users, the phenomenon of users associating themselves with groups or communities is common Thus, two networks exist simultaneously: the friendship network among users, and the affiliation network between users and groups In this article, we tackle the affiliation recommendation problem, where the task is to predict or suggest new affiliations between users and communities, given the current state of the friendship and affiliation networks More generally, affiliations need not be community affiliations---they can be a user’s taste, so affiliation recommendation algorithms have applications beyond community recommendation In this article, we show that information from the friendship network can indeed be fruitfully exploited in making affiliation recommendations Using a simple way of combining these networks, we suggest two models of user-community affinity for the purpose of making affiliation recommendations: one based on graph proximity, and another using latent factors to model users and communities We explore the affiliation recommendation algorithms suggested by these models and evaluate these algorithms on two real-world networks, Orkut and Youtube In doing so, we motivate and propose a way of evaluating recommenders, by measuring how good the top 50 recommendations are for the average user, and demonstrate the importance of choosing the right evaluation strategy The algorithms suggested by the graph proximity model turn out to be the most effective We also introduce scalable versions of these algorithms, and demonstrate their effectiveness This use of link prediction techniques for the purpose of affiliation recommendation is, to our knowledge, novel
TL;DR: This work employs cross-correlation measures, eigenvalue spectrum analysis, and results from random matrix theory to identify spatiotemporal patterns on multiple scales and shows that most significant correlation exists on the time scale of weeks.
Abstract: With the increased availability of rich behavioral datasets, we present a novel application of tools to analyze this information. Using criminal offense records as an example, we employ cross-correlation measures, eigenvalue spectrum analysis, and results from random matrix theory to identify spatiotemporal patterns on multiple scales. With these techniques, we show that most significant correlation exists on the time scale of weeks and identify clusters of neighborhoods whose crime rates are affected simultaneously by external forces.
TL;DR: The experimental results clearly show that the novel context-aware query suggestion approach outperforms three baseline methods in both coverage and quality of suggestions.
Abstract: Query suggestion plays an important role in improving usability of search engines. Although some recently proposed methods provide query suggestions by mining query patterns from search logs, none of them models the immediately preceding queries as context systematically, and uses context information effectively in query suggestions. Context-aware query suggestion is challenging in both modeling context and scaling up query suggestion using context. In this article, we propose a novel context-aware query suggestion approach. To tackle the challenges, our approach consists of two stages. In the first, offline model-learning stage, to address data sparseness, queries are summarized into concepts by clustering a click-through bipartite. A concept sequence suffix tree is then constructed from session data as a context-aware query suggestion model. In the second, online query suggestion stage, a user’s search context is captured by mapping the query sequence submitted by the user to a sequence of concepts. By looking up the context in the concept sequence suffix tree, we suggest to the user context-aware queries. We test our approach on large-scale search logs of a commercial search engine containing 4.0 billion Web queries, 5.9 billion clicks, and 1.87 billion search sessions. The experimental results clearly show that our approach outperforms three baseline methods in both coverage and quality of suggestions.
TL;DR: A new agent for repeated bilateral negotiation that was designed to model and adapt its behavior to the individual traits exhibited by its negotiation partner, showing that adaptation is a viable approach towards the design of computer agents to negotiate with people when there is no prior data of their behavior.
Abstract: The rapid dissemination of technology such as the Internet across geographical and ethnic lines is opening up opportunities for computer agents to negotiate with people of diverse cultural and organizational affiliations. To negotiate proficiently with people in different cultures, agents need to be able to adapt to the way behavioral traits of other participants change over time. This article describes a new agent for repeated bilateral negotiation that was designed to model and adapt its behavior to the individual traits exhibited by its negotiation partner. The agent’s decision-making model combined a social utility function that represented the behavioral traits of the other participant, as well as a rule-based mechanism that used the utility function to make decisions in the negotiation process. The agent was deployed in a strategic setting in which both participants needed to complete their individual tasks by reaching agreements and exchanging resources, the number of negotiation rounds was not fixed in advance and agreements were not binding. The agent negotiated with human subjects in the United States and Lebanon in situations that varied the dependency relationships between participants at the onset of negotiation. There was no prior data available about the way people would respond to different negotiation strategies in these two countries. Results showed that the agent was able to adopt a different negotiation strategy to each country. Its average performance across both countries was equal to that of people. However, the agent outperformed people in the United States, because it learned to make offers that were likely to be accepted by people, while being more beneficial to the agent than to people. In contrast, the agent was outperformed by people in Lebanon, because it adopted a high reliability measure which allowed people to take advantage of it. These results provide insight for human-computer agent designers in the types of multicultural settings that we considered, showing that adaptation is a viable approach towards the design of computer agents to negotiate with people when there is no prior data of their behavior.
TL;DR: The goal is to provide an overview of the exciting opportunities and challenges in developing and applying data mining approaches to provide critical information for forest and land use management.
Abstract: Forests are a critical component of the planet's ecosystem. Unfortunately, there has been significant degradation in forest cover over recent decades as a result of logging, conversion to crop, plantation, and pasture land, or disasters (natural or man made) such as forest fires, floods, and hurricanes. As a result, significant attention is being given to the sustainable use of forests. A key to effective forest management is quantifiable knowledge about changes in forest cover. This requires identification and characterization of changes and the discovery of the relationship between these changes and natural and anthropogenic variables. In this article, we present our preliminary efforts and achievements in addressing some of these tasks along with the challenges and opportunities that need to be addressed in the future. At a higher level, our goal is to provide an overview of the exciting opportunities and challenges in developing and applying data mining approaches to provide critical information for forest and land use management.
TL;DR: This article proposes a reputation model for HeyStaks users that utilise the implicit collaboration events that take place between users as recommendations are made and selected and indicates that incorporating reputation into the recommendation process further improves the relevance of HeyStak recommendations by up to 40%.
Abstract: Although collaborative searching is not supported by mainstream search engines, recent research has highlighted the inherently collaborative nature of many Web search tasks. In this article, we describe HeyStaks, a collaborative Web search framework that is designed to complement mainstream search engines. At search time, HeyStaks learns from the search activities of other users and leverages this information to generate recommendations based on results that others have found relevant for similar searches. The key contribution of this article is to extend the HeyStaks social search model by considering the search expertise, or reputation, of HeyStaks users and using this information to enhance the result recommendation process. In particular, we propose a reputation model for HeyStaks users that utilise the implicit collaboration events that take place between users as recommendations are made and selected. We describe a live-user trial of HeyStaks that demonstrates the relevance of its core recommendations and the ability of the reputation model to further improve recommendation quality. Our findings indicate that incorporating reputation into the recommendation process further improves the relevance of HeyStaks recommendations by up to 40p.
TL;DR: A hybrid tag recommendation system together with a scalable, highly efficient system architecture that is able to utilize user feedback to tune its parameters to specific characteristics of the underlying tagging system and adapt the recommendation models to newly added content.
Abstract: Despite all of the advantages of tags as an easy and flexible information management approach, tagging is a cumbersome task. A set of descriptive tags has to be manually entered by users whenever they post a resource. This process can be simplified by the use of tag recommendation systems. Their objective is to suggest potentially useful tags to the user. We present a hybrid tag recommendation system together with a scalable, highly efficient system architecture. The system is able to utilize user feedback to tune its parameters to specific characteristics of the underlying tagging system and adapt the recommendation models to newly added content. The evaluation of the system on six real-life datasets demonstrated the system’s ability to combine tags from various sources (e.g., resource content or tags previously used by the user) to achieve the best quality of recommended tags. It also confirmed the importance of parameter tuning and content adaptation. A series of additional experiments allowed us to better understand the characteristics of the system and tagging datasets and to determine the potential areas for further system development.
TL;DR: Two algorithms for Neyman-Pearson (NP) classification problem which has been recently shown to be of a particular importance for bipartite ranking problems are described and evaluated.
Abstract: We describe and evaluate two algorithms for Neyman-Pearson (NP) classification problem which has been recently shown to be of a particular importance for bipartite ranking problems. NP classification is a nonconvex problem involving a constraint on false negatives rate. We investigated batch algorithm based on DC programming and stochastic gradient method well suited for large-scale datasets. Empirical evidences illustrate the potential of the proposed methods.
TL;DR: This work explores different group-profiling strategies to construct descriptions of a group, and shows some potential applications based on group profiling that can assist network navigation, visualization, and analysis, as well as monitoring and tracking the ebbs and tides of different groups in evolving networks.
Abstract: The prolific use of participatory Web and social networking sites is reshaping the ways in which people interact with one another. It has become a vital part of human social life in both the developed and developing world. People sharing certain similarities or affiliates tend to form communities within social media. At the same time, they participate in various online activities: content sharing, tagging, posting status updates, etc. These diverse activities leave behind traces of their social life, providing clues to understand changing social structures. A large body of existing work focuses on extracting cohesive groups based on network topology. But little attention is paid to understanding the changing social structures. In order to help explain the formation of a group, we explore different group-profiling strategies to construct descriptions of a group. This research can assist network navigation, visualization, and analysis, as well as monitoring and tracking the ebbs and tides of different groups in evolving networks. By exploiting information collected from real-world social media sites, extensive experiments are conducted to evaluate group-profiling results. The pros and cons of different group-profiling strategies are analyzed with concrete examples. We also show some potential applications based on group profiling. Interesting findings with discussions are reported.
TL;DR: Three key ingredients of CAMAS---motif mining, association analysis, and dynamic Bayesian network inference---that help bridge the gap between low-level, raw, sensor streams, and the high-level operating regions and features needed for an operator to efficiently manage the data center are demonstrated.
Abstract: Practically every large IT organization hosts data centers---a mix of computing elements, storage systems, networking, power, and cooling infrastructure---operated either in-house or outsourced to major vendors. A significant element of modern data centers is their cooling infrastructure, whose efficient and sustainable operation is a key ingredient to the “always-on” capability of data centers. We describe the design and implementation of CAMAS (Chiller Advisory and MAnagement System), a temporal data mining solution to mine and manage chiller installations. CAMAS embodies a set of algorithms for processing multivariate time-series data and characterizes sustainability measures of the patterns mined. We demonstrate three key ingredients of CAMAS---motif mining, association analysis, and dynamic Bayesian network inference---that help bridge the gap between low-level, raw, sensor streams, and the high-level operating regions and features needed for an operator to efficiently manage the data center. The effectiveness of CAMAS is demonstrated by its application to a real-life production data center managed by HP.