scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Intelligent Systems and Technology in 2017"


Journal ArticleDOI
TL;DR: Experimental results showed that the proposed Correlation Matrix kNN (CM-kNN) classification was more accurate and efficient than existing kNN methods in data-mining applications, such as classification, regression, and missing data imputation.
Abstract: The K Nearest Neighbor (kNN) method has widely been used in the applications of data mining and machine learning due to its simple implementation and distinguished performance. However, setting all test data with the same k value in the previous kNN methods has been proven to make these methods impractical in real applications. This article proposes to learn a correlation matrix to reconstruct test data points by training data to assign different k values to different test data points, referred to as the Correlation Matrix kNN (CM-kNN for short) classification. Specifically, the least-squares loss function is employed to minimize the reconstruction error to reconstruct each test data point by all training data points. Then, a graph Laplacian regularizer is advocated to preserve the local structure of the data in the reconstruction process. Moreover, an e1-norm regularizer and an e2, 1-norm regularizer are applied to learn different k values for different test data and to result in low sparsity to remove the redundant/noisy feature from the reconstruction process, respectively. Besides for classification tasks, the kNN methods (including our proposed CM-kNN method) are further utilized to regression and missing data imputation. We conducted sets of experiments for illustrating the efficiency, and experimental results showed that the proposed method was more accurate and efficient than existing kNN methods in data-mining applications, such as classification, regression, and missing data imputation.

377 citations


Journal ArticleDOI
TL;DR: TensorBeat, a system to employ channel state information (CSI) phase difference data to intelligently estimate breathing rates for multiple persons with commodity WiFi devices, and can achieve high accuracy under different environments for multiperson breathing rate monitoring.
Abstract: Breathing signal monitoring can provide important clues for health problems. Compared to existing techniques that require wearable devices and special equipment, a more desirable approach is to provide contact-free and long-term breathing rate monitoring by exploiting wireless signals. In this article, we propose TensorBeat, a system to employ channel state information (CSI) phase difference data to intelligently estimate breathing rates for multiple persons with commodity WiFi devices. The main idea is to leverage the tensor decomposition technique to handle the CSI phase difference data. The proposed TensorBeat scheme first obtains CSI phase difference data between pairs of antennas at the WiFi receiver to create CSI tensors. Then canonical polyadic (CP) decomposition is applied to obtain the desired breathing signals. A stable signal matching algorithm is developed to identify the decomposed signal pairs, and a peak detection method is applied to estimate the breathing rates for multiple persons. Our experimental study shows that TensorBeat can achieve high accuracy under different environments for multiperson breathing rate monitoring.

159 citations


Journal ArticleDOI
TL;DR: A systematic taxonomy for P2P lending is provided by summarizing different types of mainstream platforms and comparing their working mechanisms in detail, and some analysis on real-world data collected from Prosper and Kiva are conducted.
Abstract: P2P lending is an emerging Internet-based application where individuals can directly borrow money from each other. The past decade has witnessed the rapid development and prevalence of online P2P lending platforms, examples of which include Prosper, LendingClub, and Kiva. Meanwhile, extensive research has been done that mainly focuses on the studies of platform mechanisms and transaction data. In this article, we provide a comprehensive survey on the research about P2P lending, which, to the best of our knowledge, is the first focused effort in this field. Specifically, we first provide a systematic taxonomy for P2P lending by summarizing different types of mainstream platforms and comparing their working mechanisms in detail. Then, we review and organize the recent advances on P2P lending from various perspectives (e.g., economics and sociology perspective, and data-driven perspective). Finally, we propose our opinions on the prospects of P2P lending and suggest some future research directions in this field. Meanwhile, throughout this paper, some analysis on real-world data collected from Prosper and Kiva are also conducted.

100 citations


Journal ArticleDOI
TL;DR: This article proposes a novel crowdsensing task allocation framework called SPACE-TA (SPArse Cost-Effective Task Allocation), combining compressive sensing, statistical analysis, active learning, and transfer learning, to dynamically select a small set of subareas for sensing in each timeslot (cycle), while inferring the data of unsensed subarea under a probabilistic data quality guarantee.
Abstract: Data quality and budget are two primary concerns in urban-scale mobile crowdsensing. Traditional research on mobile crowdsensing mainly takes sensing coverage ratio as the data quality metric rather than the overall sensed data error in the target-sensing area. In this article, we propose to leverage spatiotemporal correlations among the sensed data in the target-sensing area to significantly reduce the number of sensing task assignments. In particular, we exploit both intradata correlations within the same type of sensed data and interdata correlations among different types of sensed data in the sensing task. We propose a novel crowdsensing task allocation framework called SPACE-TA (SPArse Cost-Effective Task Allocation), combining compressive sensing, statistical analysis, active learning, and transfer learning, to dynamically select a small set of subareas for sensing in each timeslot (cycle), while inferring the data of unsensed subareas under a probabilistic data quality guarantee. Evaluations on real-life temperature, humidity, air quality, and traffic monitoring datasets verify the effectiveness of SPACE-TA. In the temperature-monitoring task leveraging intradata correlations, SPACE-TA requires data from only 15.5% of the subareas while keeping the inference error below 0.25°C in 95% of the cycles, reducing the number of sensed subareas by 18.0% to 26.5% compared to baselines. When multiple tasks run simultaneously, for example, for temperature and humidity monitoring, SPACE-TA can further reduce ∼10% of the sensed subareas by exploiting interdata correlations.

62 citations


Journal ArticleDOI
TL;DR: This article presents “AirHopper,” a bifurcated malware that bridges the air gap between an isolated network and nearby infected mobile phones using FM signals, and demonstrates how valuable data can be exfiltrated from physically isolated computers to mobile phones at a distance of 1--7 meters, with an effective bandwidth of 13--60 bytes per second.
Abstract: Information is the most critical asset of modern organizations, and accordingly it is one of the resources most coveted by adversaries. When highly sensitive data is involved, an organization may resort to air gap isolation in which there is no networking connection between the inner network and the external world. While infiltrating an air-gapped network has been proven feasible in recent years, data exfiltration from an air-gapped network is still considered one of the most challenging phases of an advanced cyber-attack. In this article, we present “AirHopper,” a bifurcated malware that bridges the air gap between an isolated network and nearby infected mobile phones using FM signals. While it is known that software can intentionally create radio emissions from a video card, this is the first time that mobile phones serve as the intended receivers of the maliciously crafted electromagnetic signals. We examine the attack model and its limitations and discuss implementation considerations such as modulation methods, signal collision, and signal reconstruction. We test AirHopper in an existing workplace at a typical office building and demonstrate how valuable data such as keylogging and files can be exfiltrated from physically isolated computers to mobile phones at a distance of 1--7 meters, with an effective bandwidth of 13--60 bytes per second.

53 citations


Journal ArticleDOI
TL;DR: This article addresses the fundamental problem of learning to optimize image similarity with sparse and high-dimensional representations from large-scale training data, and proposes a novel scheme of Sparse Online Learning of Image Similarity (SOLIS), which enjoys significant advantages in computational efficiency, model sparsity, and retrieval scalability.
Abstract: Learning image similarity plays a critical role in real-world multimedia information retrieval applications, especially in Content-Based Image Retrieval (CBIR) tasks, in which an accurate retrieval of visually similar objects largely relies on an effective image similarity function. Crafting a good similarity function is very challenging because visual contents of images are often represented as feature vectors in high-dimensional spaces, for example, via bag-of-words (BoW) representations, and traditional rigid similarity functions, for example, cosine similarity, are often suboptimal for CBIR tasks. In this article, we address this fundamental problem, that is, learning to optimize image similarity with sparse and high-dimensional representations from large-scale training data, and propose a novel scheme of Sparse Online Learning of Image Similarity (SOLIS). In contrast to many existing image-similarity learning algorithms that are designed to work with low-dimensional data, SOLIS is able to learn image similarity from large-scale image data in sparse and high-dimensional spaces. Our encouraging results showed that the proposed new technique achieves highly competitive accuracy as compared to the state-of-the-art approaches but enjoys significant advantages in computational efficiency, model sparsity, and retrieval scalability, making it more practical for real-world multimedia retrieval applications.

52 citations


Journal ArticleDOI
TL;DR: In this paper, a graph-constrained coalition formation (GCCF) problem is formulated as a rooted tree and an anytime solution algorithm (Coalition Formation for Sparse Synergies (CFSS)) is proposed.
Abstract: Coalition formation typically involves the coming together of multiple, heterogeneous, agents to achieve both their individual and collective goals. In this article, we focus on a special case of coalition formation known as Graph-Constrained Coalition Formation (GCCF) whereby a network connecting the agents constrains the formation of coalitions. We focus on this type of problem given that in many real-world applications, agents may be connected by a communication network or only trust certain peers in their social network. We propose a novel representation of this problem based on the concept of edge contraction, which allows us to model the search space induced by the GCCF problem as a rooted tree. Then, we propose an anytime solution algorithm (Coalition Formation for Sparse Synergies (CFSS)), which is particularly efficient when applied to a general class of characteristic functions called m + a functions. Moreover, we show how CFSS can be efficiently parallelised to solve GCCF using a nonredundant partition of the search space. We benchmark CFSS on both synthetic and realistic scenarios, using a real-world dataset consisting of the energy consumption of a large number of households in the UK. Our results show that, in the best case, the serial version of CFSS is four orders of magnitude faster than the state of the art, while the parallel version is 9.44 times faster than the serial version on a 12-core machine. Moreover, CFSS is the first approach to provide anytime approximate solutions with quality guarantees for very large systems of agents (i.e., with more than 2,700 agents).

46 citations


Journal ArticleDOI
TL;DR: This article develops a general probabilistic framework for spotting trip purposes from massive taxi GPS trajectories and introduces a latent factor, POI Topic, to represent the mixed functionality of the regions, such that each origin or destination point in the city can be modeled as a mixture over POI Topics.
Abstract: What is the purpose of a trip? What are the unique human mobility patterns and spatial contexts in or near the pickup points and delivery points of trajectories for a specific trip purpose? Many prior studies have modeled human mobility patterns in urban regions; however, these analytics mainly focus on interpreting the semantic meanings of geographic topics at an aggregate level. Given the lack of information about human activities at pick-up and dropoff points, it is challenging to convert the prior studies into effective tools for inferring trip purposes. To address this challenge, in this article, we study large-scale taxi trajectories from an unsupervised perspective in light of the following observations. First, the POI configurations of origin and destination regions closely relate to the urban functionality of these regions and further indicate various human activities. Second, with respect to the functionality of neighborhood environments, trip purposes can be discerned from the transitions between regions with different functionality at particular time periods. Along these lines, we develop a general probabilistic framework for spotting trip purposes from massive taxi GPS trajectories. Specifically, we first augment the origin and destination regions of trajectories by attaching neighborhood POIs. Then, we introduce a latent factor, POI Topic, to represent the mixed functionality of the regions, such that each origin or destination point in the city can be modeled as a mixture over POI Topics. In addition, considering the transitions from origins to destinations at specific time periods, the trip time is generated collaboratively from the pairwise POI Topics at both ends of the O-D pairs, constituting POI Links, and hence the trip purpose can be explained semantically by the POI Links. Finally, we present extensive experiments with the real-world data of New York City to demonstrate the effectiveness of our proposed method for spotting trip purposes, and moreover, the model is validated to perform well in predicting the destinations and trip time among all the baseline methods.

44 citations


Journal ArticleDOI
TL;DR: DisCor-T, a novel graph-based approach for discovering underlying connections of things via mining the rich content embodied in the human-thing interactions in terms of user, temporal, and spatial information, is presented.
Abstract: With recent advances in radio-frequency identification (RFID), wireless sensor networks, and Web services, physical things are becoming an integral part of the emerging ubiquitous Web. Finding correlations among ubiquitous things is a crucial prerequisite for many important applications such as things search, discovery, classification, recommendation, and composition. This article presents DisCor-T, a novel graph-based approach for discovering underlying connections of things via mining the rich content embodied in the human-thing interactions in terms of user, temporal, and spatial information. We model this various information using two graphs, namely a spatio-temporal graph and a social graph. Then, random walk with restart (RWR) is applied to find proximities among things, and a relational graph of things (RGT) indicating implicit correlations of things is learned. The correlation analysis lays a solid foundation contributing to improved effectiveness in things management and analytics. To demonstrate the utility of the proposed approach, we develop a flexible feature-based classification framework on top of RGT and perform a systematic case study. Our evaluation exhibits the strength and feasibility of the proposed approach.

43 citations


Journal ArticleDOI
TL;DR: The idea is to train classifiers to capture the relation between individual mobility patterns and the level of privacy risk of individuals, and shows the effectiveness of the approach by an extensive experiment on real-world GPS data in two urban areas.
Abstract: Human mobility data are an important proxy to understand human mobility dynamics, develop analytical services, and design mathematical models for simulation and what-if analysis. Unfortunately mobility data are very sensitive since they may enable the re-identification of individuals in a database. Existing frameworks for privacy risk assessment provide data providers with tools to control and mitigate privacy risks, but they suffer two main shortcomings: (i) they have a high computational complexity; (ii) the privacy risk must be recomputed every time new data records become available and for every selection of individuals, geographic areas, or time windows. In this article, we propose a fast and flexible approach to estimate privacy risk in human mobility data. The idea is to train classifiers to capture the relation between individual mobility patterns and the level of privacy risk of individuals. We show the effectiveness of our approach by an extensive experiment on real-world GPS data in two urban areas and investigate the relations between human mobility patterns and the privacy risk of individuals.

43 citations


Journal ArticleDOI
TL;DR: A detailed study of 1.6 million machines over an 8-month period shows that software developers are more at risk of engaging in risky cyber-behavior than other categories.
Abstract: Despite growing speculation about the role of human behavior in cyber-security of machines, concrete data-driven analysis and evidence have been lacking. Using Symantec’s WINE platform, we conduct a detailed study of 1.6 million machines over an 8-month period in order to learn the relationship between user behavior and cyber attacks against their personal computers. We classify users into 4 categories (gamers, professionals, software developers, and others, plus a fifth category comprising everyone) and identify a total of 7 features that act as proxies for human behavior. For each of the 35 possible combinations (5 categories times 7 features), we studied the relationship between each of these seven features and one dependent variable, namely the number of attempted malware attacks detected by Symantec on the machine. Our results show that there is a strong relationship between several features and the number of attempted malware attacks. Had these hosts not been protected by Symantec’s anti-virus product or a similar product, they would likely have been infected. Surprisingly, our results show that software developers are more at risk of engaging in risky cyber-behavior than other categories.

Journal ArticleDOI
TL;DR: A novel approach to detecting drug abuse and dealing automatically by utilizing multimodal data on social media and multi-task learning and decision-level fusion are employed in this framework.
Abstract: Illicit drug trade via social media sites, especially photo-oriented Instagram, has become a severe problem in recent years. As a result, tracking drug dealing and abuse on Instagram is of interest to law enforcement agencies and public health agencies. However, traditional approaches are based on manual search and browsing by trained domain experts, which suffers from the problem of poor scalability and reproducibility. In this article, we propose a novel approach to detecting drug abuse and dealing automatically by utilizing multimodal data on social media. This approach also enables us to identify drug-related posts and analyze the behavior patterns of drug-related user accounts. To better utilize multimodal data on social media, multimodal analysis methods including multi-task learning and decision-level fusion are employed in our framework. We collect three datasets using Instagram and web search engine for training and testing our models. Experiment results on expertly labeled data have demonstrated the effectiveness of our approach, as well as its scalability and reproducibility over labor-intensive conventional approaches.

Journal ArticleDOI
TL;DR: Results indicate that the optimization model is scalable and is capable of identifying both the right mix of analyst expertise in an organization and the sensor-to-analyst allocation in order to maintain risk below a given upper bound.
Abstract: Cybersecurity threats are on the rise with evermore digitization of the information that many day-to-day systems depend upon. The demand for cybersecurity analysts outpaces supply, which calls for optimal management of the analyst resource. Therefore, a key component of the cybersecurity defense system is the optimal scheduling of its analysts. Sensor data is analyzed by automatic processing systems, and alerts are generated. A portion of these alerts is considered to be significant, which requires thorough examination by a cybersecurity analyst. Risk, in this article, is defined as the percentage of unanalyzed or not thoroughly analyzed alerts among the significant alerts by analysts. The article presents a generalized optimization model for scheduling cybersecurity analysts to minimize risk (a.k.a., maximize significant alert coverage by analysts) and maintain risk under a pre-determined upper bound. The article tests the optimization model and its scalability on a set of given sensors with varying analyst experiences, alert generation rates, system constraints, and system requirements. Results indicate that the optimization model is scalable and is capable of identifying both the right mix of analyst expertise in an organization and the sensor-to-analyst allocation in order to maintain risk below a given upper bound. Several meta-principles are presented, which are derived from the optimization model, and they further serve as guiding principles for hiring and scheduling cybersecurity analysts. The simulation studies (validation) of the optimization model outputs indicate that risk varies non-linearly with an analyst/sensor ratio, and for a given analyst/sensor ratio, the risk is independent of the number of sensors in the system.

Journal ArticleDOI
TL;DR: This article adapts the autoencoder technique to transfer learning and proposes a supervised representation learning method based on double encoding-layer autoen coder that can better reflect the degree of transfer difficulty and has stronger correlation with the performance from supervised learning algorithms.
Abstract: Transfer learning has gained a lot of attention and interest in the past decade. One crucial research issue in transfer learning is how to find a good representation for instances of different domains such that the divergence between domains can be reduced with the new representation. Recently, deep learning has been proposed to learn more robust or higher-level features for transfer learning. In this article, we adapt the autoencoder technique to transfer learning and propose a supervised representation learning method based on double encoding-layer autoencoder. The proposed framework consists of two encoding layers: one for embedding and the other one for label encoding. In the embedding layer, the distribution distance of the embedded instances between the source and target domains is minimized in terms of KL-Divergence. In the label encoding layer, label information of the source domain is encoded using a softmax regression model. Moreover, to empirically explore why the proposed framework can work well for transfer learning, we propose a new effective measure based on autoencoder to compute the distribution distance between different domains. Experimental results show that the proposed new measure can better reflect the degree of transfer difficulty and has stronger correlation with the performance from supervised learning algorithms (e.g., Logistic Regression), compared with previous ones, such as KL-Divergence and Maximum Mean Discrepancy. Therefore, in our model, we have incorporated two distribution distance measures to minimize the difference between source and target domains in the embedding representations. Extensive experiments conducted on three real-world image datasets and one text data demonstrate the effectiveness of our proposed method compared with several state-of-the-art baseline methods.

Journal ArticleDOI
TL;DR: A deep multicontext network with fine-to-coarse strategy to model fusion features of sensitive objects in images and investigates a novel hierarchical method to make the model more discriminative for diverse target objects.
Abstract: Adult image and video recognition is an important and challenging problem in the real world. Low-level feature cues do not produce good enough information, especially when the dataset is very large and has various data distributions. This issue raises a serious problem for conventional approaches. In this article, we tackle this problem by proposing a deep multicontext network with fine-to-coarse strategy for adult image and video recognition. We employ a deep convolution networks to model fusion features of sensitive objects in images. Global contexts and local contexts are both taken into consideration and are jointly modeled in a unified multicontext deep learning framework. To make the model more discriminative for diverse target objects, we investigate a novel hierarchical method, and a task-specific fine-to-coarse strategy is designed to make the multicontext modeling more suitable for adult object recognition. Furthermore, some recently proposed deep models are investigated. Our approach is extensively evaluated on four different datasets. One dataset is used for ablation experiments, whereas others are used for generalization experiments. Results show significant and consistent improvements over the state-of-the-art methods.

Journal ArticleDOI
TL;DR: A Multi-modal Multi-instance Deep Network (M2DN) for microblogs classification is introduced, able to handle the weakly labeled microblogs data oriented from the incompatible meanings inside microblogs, and besides predicting each microblogs as predefined events, it is proposed to employ social tracking to extract social-related auxiliary information to enrich the testing samples.
Abstract: Social media websites have become important information sharing platforms. The rapid development of social media platforms has led to increasingly large-scale social media data, which has shown remarkable societal and marketing values. There are needs to extract important events in live social media streams. However, microblogs event classification is challenging due to two facts, i.e., the short/conversational nature and the incompatible meanings between the text and the corresponding image in social posts, and the rapidly evolving contents. In this article, we propose to conduct event classification via deep learning and social tracking. First, we introduce a Multi-modal Multi-instance Deep Network (M2DN) for microblogs classification, which is able to handle the weakly labeled microblogs data oriented from the incompatible meanings inside microblogs. Besides predicting each microblogs as predefined events, we propose to employ social tracking to extract social-related auxiliary information to enrich the testing samples. We extract a set of candidate-relevant microblogs in a short time window by using social connections, such as related users and geographical locations. All these selected microblogs and the testing data are formulated in a Markov Random Field model. The inference on the Markov Random Field is conducted to update the classification results of the testing microblogs. This method is evaluated on the Brand-Social-Net dataset for classification of 20 events. Experimental results and comparison with the state of the arts show that the proposed method can achieve better performance for the event classification task.

Journal ArticleDOI
TL;DR: This article develops a semi-decentralized multiagent-based vehicle routing approach where vehicle agents follow the local route guidance by infrastructure agents at each intersection, and infrastructure agents perform the routes guidance by solving a route assignment problem.
Abstract: Arriving on time and total travel time are two important properties for vehicle routing. Existing route guidance approaches always consider them independently, because they may conflict with each other. In this article, we develop a semi-decentralized multiagent-based vehicle routing approach where vehicle agents follow the local route guidance by infrastructure agents at each intersection, and infrastructure agents perform the route guidance by solving a route assignment problem. It integrates the two properties by expressing them as two objective terms of the route assignment problem. Regarding arriving on time, it is formulated based on the probability tail model, which aims to maximize the probability of reaching destination before deadline. Regarding total travel time, it is formulated as a weighted quadratic term, which aims to minimize the expected travel time from the current location to the destination based on the potential route assignment. The weight for total travel time is designed to be comparatively large if the deadline is loose. Additionally, we improve the proposed approach in two aspects, including travel time prediction and computational efficiency. Experimental results on real road networks justify its ability to increase the average probability of arriving on time, reduce total travel time, and enhance the overall routing performance.

Journal ArticleDOI
TL;DR: It is found that small, random rating manipulations on social media posts and comments created significant changes in downstream ratings, resulting in significantly different final outcomes.
Abstract: At a time when information seekers first turn to digital sources for news and opinion, it is critical that we understand the role that social media plays in human behavior. This is especially true when information consumers also act as information producers and editors through their online activity. In order to better understand the effects that editorial ratings have on online human behavior, we report the results of a two large-scale in vivo experiments in social media. We find that small, random rating manipulations on social media posts and comments created significant changes in downstream ratings, resulting in significantly different final outcomes. We found positive herding effects for positive treatments on posts, increasing the final rating by 11.02% on average, but not for positive treatments on comments. Contrary to the results of related work, we found negative herding effects for negative treatments on posts and comments, decreasing the final ratings, on average, of posts by 5.15% and of comments by 37.4%. Compared to the control group, the probability of reaching a high rating ( g 2,000) for posts is increased by 24.6% when posts receive the positive treatment and for comments it is decreased by 46.6% when comments receive the negative treatment.

Journal ArticleDOI
TL;DR: This paper proposes a new graph regularized NMF method capable of feature learning and outperforms state-of-the-art algorithms when applied to clustering.
Abstract: Matrix factorization is a useful technique for data representation in many data mining and machine learning tasks. Particularly, for data sets with all nonnegative entries, matrix factorization often requires that factor matrices be nonnegative, leading to nonnegative matrix factorization (NMF). One important application of NMF is for clustering with reduced dimensions of the data represented in the new feature space. In this paper, we propose a new graph regularized NMF method capable of feature learning and apply it to clustering. Unlike existing NMF methods that treat all features in the original feature space equally, our method distinguishes features by incorporating a feature-wise sparse approximation error matrix in the formulation. It enables important features to be more closely approximated by the factor matrices. Meanwhile, the graph of the data is constructed using cleaner features in the feature learning process, which integrates feature learning and manifold learning procedures into a unified NMF model. This distinctly differs from applying the existing graph-based NMF models after feature selection in that, when these two procedures are independently used, they often fail to align themselves toward obtaining a compact and most expressive data representation. Comprehensive experimental results demonstrate the effectiveness of the proposed method, which outperforms state-of-the-art algorithms when applied to clustering.

Journal ArticleDOI
TL;DR: This article proposes a framework to measure the carefulness of users and develops a supervised learning algorithm to estimate it based on known spammers and legitimate users, and finds that the measure is also beneficial for other applications, such as link prediction.
Abstract: Microblogging Web sites, such as Twitter and Sina Weibo, have become popular platforms for socializing and sharing information in recent years. Spammers have also discovered this new opportunity to unfairly overpower normal users with unsolicited content, namely social spams. Although it is intuitive for everyone to follow legitimate users, recent studies show that both legitimate users and spammers follow spammers for different reasons. Evidence of users seeking spammers on purpose is also observed. We regard this behavior as useful information for spammer detection. In this article, we approach the problem of spammer detection by leveraging the “carefulness” of users, which indicates how careful a user is when she is about to follow a potential spammer. We propose a framework to measure the carefulness and develop a supervised learning algorithm to estimate it based on known spammers and legitimate users. We illustrate how the robustness of the detection algorithms can be improved with aid of the proposed measure. Evaluation on two real datasets from Sina Weibo and Twitter with millions of users are performed, as well as an online test on Sina Weibo. The results show that our approach indeed captures the carefulness, and it is effective for detecting spammers. In addition, we find that our measure is also beneficial for other applications, such as link prediction.

Journal ArticleDOI
TL;DR: The Bayesian Comfort Model (BCM) is proposed, a personalised thermal comfort model that uses a Bayesian network to learn from a user’s feedback, allowing it to adapt to the users’ individual preferences over time and create an optimal HVAC control algorithm that minimizes energy consumption while preserving user comfort.
Abstract: In this article, we address the interrelated challenges of predicting user comfort and using this to reduce energy consumption in smart heating, ventilation, and air conditioning (HVAC) systems. At present, such systems use simple models of user comfort when deciding on a set-point temperature. Being built using broad population statistics, these models generally fail to represent individual users’ preferences, resulting in poor estimates of the users’ preferred temperatures. To address this issue, we propose the Bayesian Comfort Model (BCM). This personalised thermal comfort model uses a Bayesian network to learn from a user’s feedback, allowing it to adapt to the users’ individual preferences over time. We further propose an alternative to the ASHRAE 7-point scale used to assess user comfort. Using this model, we create an optimal HVAC control algorithm that minimizes energy consumption while preserving user comfort. Through an empirical evaluation based on the ASHRAE RP-884 dataset and data collected in a separate deployment by us, we show that our model is consistently 13.2% to 25.8% more accurate than current models and how using our alternative comfort scale can increase our model’s accuracy. Through simulations we show that using this model, our HVAC control algorithm can reduce energy consumption by 7.3% to 13.5% while decreasing user discomfort by 24.8% simultaneously.

Journal ArticleDOI
TL;DR: This article aims to exploit geotagged photos from online photo-sharing sites for the purpose of personalized Point-of-Interest (POI) recommendation by augmenting current collaborative filtering algorithms along from multiple perspectives and applying the matrix factorization algorithm to integrate the disparate sources of preference and relationship information.
Abstract: As mobile device penetration increases, it has become pervasive for images to be associated with locations in the form of geotags. Geotags bridge the gap between the physical world and the cyberspace, giving rise to new opportunities to extract further insights into user preferences and behaviors. In this article, we aim to exploit geotagged photos from online photo-sharing sites for the purpose of personalized Point-of-Interest (POI) recommendation. Owing to the fact that most users have only very limited travel experiences, data sparseness poses a formidable challenge to personalized POI recommendation. To alleviate data sparseness, we propose to augment current collaborative filtering algorithms along from multiple perspectives. Specifically, hybrid preference cues comprising user-uploaded and user-favored photos are harvested to study users’ tastes. Moreover, heterogeneous high-order relationship information is jointly captured from user social networks and POI multimodal contents with hypergraph models. We also build upon the matrix factorization algorithm to integrate the disparate sources of preference and relationship information, and apply our approach to directly optimize user preference rankings. Extensive experiments on a large and publicly accessible dataset well verified the potential of our approach for addressing data sparseness and offering quality recommendations to users, especially for those who have only limited travel experiences.

Journal ArticleDOI
TL;DR: Experimental results clearly demonstrate that ST-SAGE outperforms the state-of-the-art recommender systems in terms of recommendation effectiveness, model training efficiency, and online recommendation efficiency.
Abstract: With the rapid development of location-based social networks (LBSNs), spatial item recommendation has become an important mobile application, especially when users travel away from home. However, this type of recommendation is very challenging compared to traditional recommender systems. A user may visit only a limited number of spatial items, leading to a very sparse user-item matrix. This matrix becomes even sparser when the user travels to a distant place, as most of the items visited by a user are usually located within a short distance from the user’s home. Moreover, user interests and behavior patterns may vary dramatically across different time and geographical regions. In light of this, we propose ST-SAGE, a spatial-temporal sparse additive generative model for spatial item recommendation in this article. ST-SAGE considers both personal interests of the users and the preferences of the crowd in the target region at the given time by exploiting both the co-occurrence patterns and content of spatial items. To further alleviate the data-sparsity issue, ST-SAGE exploits the geographical correlation by smoothing the crowd’s preferences over a well-designed spatial index structure called the spatial pyramid. To speed up the training process of ST-SAGE, we implement a parallel version of the model inference algorithm on the GraphLab framework. We conduct extensive experiments; the experimental results clearly demonstrate that ST-SAGE outperforms the state-of-the-art recommender systems in terms of recommendation effectiveness, model training efficiency, and online recommendation efficiency.

Journal ArticleDOI
TL;DR: This article examines the representation and role of combining medical knowledge automatically derived from clinical practice and research findings for inferring answers to medical questions and considers three possible representations of medical knowledge sketches that were used by four different probabilistic inference methods to pinpoint the answers from the CPTG.
Abstract: Answering medical questions related to complex medical cases, as required in modern Clinical Decision Support (CDS) systems, imposes (1) access to vast medical knowledge and (2) sophisticated inference techniques. In this article, we examine the representation and role of combining medical knowledge automatically derived from (a) clinical practice and (b) research findings for inferring answers to medical questions. Knowledge from medical practice was distilled from a vast Electronic Medical Record (EMR) system, while research knowledge was processed from biomedical articles available in PubMed Central. The knowledge automatically acquired from the EMR system took into account the clinical picture and therapy recognized from each medical record to generate a probabilistic Markov network denoted as a Clinical Picture and Therapy Graph (CPTG). Moreover, we represented the background of medical questions available from the description of each complex medical case as a medical knowledge sketch. We considered three possible representations of medical knowledge sketches that were used by four different probabilistic inference methods to pinpoint the answers from the CPTG. In addition, several answer-informed relevance models were developed to provide a ranked list of biomedical articles containing the answers. Evaluations on the TREC-CDS data show which of the medical knowledge representations and inference methods perform optimally. The experiments indicate an improvement of biomedical article ranking by 49% over state-of-the-art results.

Journal ArticleDOI
TL;DR: The first data-driven, low-cost indoor white space identification system for White-space Indoor Spectrum EnhanceR (WISER), to allow secondary users to identify white spaces for communication without sensing the spectrum themselves, is developed.
Abstract: It is a promising vision to exploit white spaces, that is, vacant VHF and UHF TV channels, to meet the rapidly growing demand for wireless data services in both outdoor and indoor scenarios. While most prior works have focused on outdoor white space, the indoor story is largely open for investigation. Motivated by this observation and discovering that 70% of the spectrum demand comes from indoor environment, we carry out a comprehensive study to explore indoor white spaces. We first conduct a large-scale measurement study and compare outdoor and indoor TV spectrum occupancy at 30+ diverse locations in a typical metropolis—Hong Kong. Our results show that abundant white spaces are available in different areas in Hong Kong, which account for more than 50% and 70% of the entire TV spectrum in outdoor and indoor scenarios, respectively. Although there are substantially more white spaces indoors than outdoors, there have been very few solutions for identifying indoor white space. To fill in this gap, we develop the first data-driven, low-cost indoor white space identification system for White-space Indoor Spectrum EnhanceR (WISER), to allow secondary users to identify white spaces for communication without sensing the spectrum themselves. We design the architecture and algorithms to address the inherent challenges. We build a WISER prototype and carry out real-world experiments to evaluate its performance. Our results show that WISER can identify 30%--40% more indoor white spaces with negligible false alarms, as compared to alternative baseline approaches.

Journal ArticleDOI
TL;DR: A refined-graph regularized nonnegative matrix factorization by employing a manifold regularized least-squares regression (MRLSR) method to compute the refined graph is proposed and Experimental results on several image datasets reveal that they outperform 11 representative methods.
Abstract: Nonnegative matrix factorization (NMF) is one of the most popular data representation methods in the field of computer vision and pattern recognition. High-dimension data are usually assumed to be sampled from the submanifold embedded in the original high-dimension space. To preserve the locality geometric structure of the data, k-nearest neighbor (k-NN) graph is often constructed to encode the near-neighbor layout structure. However, k-NN graph is based on Euclidean distance, which is sensitive to noise and outliers. In this article, we propose a refined-graph regularized nonnegative matrix factorization by employing a manifold regularized least-squares regression (MRLSR) method to compute the refined graph. In particular, each sample is represented by the whole dataset regularized with e2-norm and Laplacian regularizer. Then a MRLSR graph is constructed based on the representative coefficients of each sample. Moreover, we present two optimization schemes to generate refined-graphs by employing a hard-thresholding technique. We further propose two refined-graph regularized nonnegative matrix factorization methods and use them to perform image clustering. Experimental results on several image datasets reveal that they outperform 11 representative methods.

Journal ArticleDOI
TL;DR: This article proposes a joint probabilistic latent factor model to integrate rich information into a matrix factorization-based solution to microtopic recommendation and shows that this model significantly outperforms a few competitive baseline methods, especially in the circumstance where users have few adoption behaviors.
Abstract: Microblogging services such as Sina Weibo and Twitter allow users to create tags explicitly indicated by the n symbol. In Sina Weibo, these tags are called microtopics, and in Twitter, they are called hashtags. In Sina Weibo, each microtopic has a designate page and can be directly visited or commented on. Recommending these microtopics to users based on their interests can help users efficiently acquire information. However, it is non-trivial to recommend microtopics to users to satisfy their information needs. In this article, we investigate the task of personalized microtopic recommendation, which exhibits two challenges. First, users usually do not give explicit ratings to microtopics. Second, there exists rich information about users and microtopics, for example, users' published content and biographical information, but it is not clear how to best utilize such information. To address the above two challenges, we propose a joint probabilistic latent factor model to integrate rich information into a matrix factorization-based solution to microtopic recommendation. Our model builds on top of collaborative filtering, content analysis, and feature regression. Using two real-world datasets, we evaluate our model with different kinds of content and contextual information. Experimental results show that our model significantly outperforms a few competitive baseline methods, especially in the circumstance where users have few adoption behaviors.

Journal ArticleDOI
TL;DR: The technological aspects of cyber— such as computer technologies, access to information and systems, greater connectivity among subsystems, and the combined effect of all these aspects on a growing list of diverse spheres—expose the world to unprecedented risks.
Abstract: We are living in a unique period of history, and the current technology revolution will be among the most dramatic societal transformations remembered by humanity. The important changes associated with the invention of the engine, electricity, and the printing press gradually transformed society in the western world over a period of over a hundred years. The changes accompanying the current revolution have significantly altered the lives of average citizens across the globe in less than a generation. This is unprecedented. In the past, revolutions spanned decades, enabling the establishment of processes and systems. For example, a language that supports the new revolution evolves, and leaders emerge, with the fresh perspective required by the revolutionary changes. New disciplines are created and new occupations are developed to support the changes. The present revolution is taking place at such a high speed that such enabling processes and systems have not yet been established, let alone developed or matured, and they will continue to be created well into the future. Over the past three decades, an important new vector of the technology revolution has emerged: the cyber domain. In particular, the technological aspects of cyber— such as computer technologies, access to information and systems, greater connectivity among subsystems, and the combined effect of all these aspects on a growing list of diverse spheres—expose the world to unprecedented risks. Academia and the intellectual

Journal ArticleDOI
TL;DR: A unified model is proposed to jointly regularize the source consistency and graph-constrained relatedness among tasks and is able to learn the attribute-specific and attribute-sharing features via graph-guided fused lasso penalty.
Abstract: Learning user attributes from mobile social media is a fundamental basis for many applications, such as personalized and targeting services. A large and growing body of literature has investigated the user attributes learning problem. However, far too little attention has been paid to jointly consider the dual heterogeneities of user attributes learning by harvesting multiple social media sources. In particular, user attributes are complementarily and comprehensively characterized by multiple social media sources, including footprints from Foursqare, daily updates from Twitter, professional careers from Linkedin, and photo posts from Instagram. On the other hand, attributes are inter-correlated in a complex way rather than independent to each other, and highly related attributes may share similar feature sets. Towards this end, we proposed a unified model to jointly regularize the source consistency and graph-constrained relatedness among tasks. As a byproduct, it is able to learn the attribute-specific and attribute-sharing features via graph-guided fused lasso penalty. Besides, we have theoretically demonstrated its optimization. Extensive evaluations on a real-world dataset thoroughly demonstrated the effectiveness of our proposed model.

Journal ArticleDOI
TL;DR: An event simulation mechanism is introduced, which makes it possible to conduct a comprehensive performance study of the proposed SmartEdge algorithm, and the quality of the detected patterns is measured, in a systematic way, in terms of timeliness and location accuracy.
Abstract: Given a spatial field and the traffic flow between neighboring locations, the early detection of gathering events (edge) problem aims to discover and localize a set of most likely gathering events. It is important for city planners to identify emerging gathering events that might cause public safety or sustainability concerns. However, it is challenging to solve the edge problem due to numerous candidate gathering footprints in a spatial field and the nontrivial task of balancing pattern quality and computational efficiency. Prior solutions to model the edge problem lack the ability to describe the dynamic flow of traffic and the potential gathering destinations because they rely on static or undirected footprints. In our recent work, we modeled the footprint of a gathering event as a Gathering Graph (G-Graph), where the root of the directed acyclic G-Graph is the potential destination and the directed edges represent the most likely paths traffic takes to move toward the destination. We also proposed an efficient algorithm called SmartEdge to discover the most likely nonoverlapping G-Graphs in the given spatial field. However, it is challenging to perform a systematic performance study of the proposed algorithm, due to unavailability of the ground truth of gathering events. In this article, we introduce an event simulation mechanism, which makes it possible to conduct a comprehensive performance study of the SmartEdge algorithm. We measure the quality of the detected patterns, in a systematic way, in terms of timeliness and location accuracy. The results show that, on average, the SmartEdge algorithm is able to detect patterns within a grid cell away (less than 500 meters) of the simulated events and detect patterns of the simulated events as early as 10 minutes prior to the first arrival to the gathering event.