Showing papers in "World Wide Web in 2016"

PDF

Open Access

Journal Article•DOI•

Bottleneck-aware arrangement over event-based social networks: the max-min approach

[...]

Yongxin Tong, Jieying She¹, Rui Meng¹•Institutions (1)

Hong Kong University of Science and Technology¹

01 Nov 2016-World Wide Web

TL;DR: This paper formally defines the problem of bottleneck-aware social event arrangement (BSEA), and devise two greedy heuristic algorithms, Greedy and Random+Greedy, and a local-search-based optimization technique to solve the BSEA problem.

...read moreread less

Abstract: With the popularity of mobile computing and social media, various kinds of online event-based social network (EBSN) platforms, such as Meetup, Plancast and Whova, are gaining in prominence. A fundamental task of managing EBSN platforms is to recommend suitable social events to potential users according to the following three factors: spatial locations of events and users, attribute similarities between events and users, and friend relationships among users. However, none of the existing approaches considers all the aforementioned influential factors when they recommend users to proper events. Furthermore, the existing recommendation strategies neglect the bottleneck cases of the global recommendation. Thus, it is impossible for the existing recommendation solutions to be fair in real-world scenarios. In this paper, we first formally define the problem of bottleneck-aware social event arrangement (BSEA), which is proven to be NP-hard. To solve the BSEA problem approximately, we devise two greedy heuristic algorithms, Greedy and Random+Greedy, and a local-search-based optimization technique. In particular, the Greedy algorithm is more effective but less efficient than the Random+Greedy algorithm in most cases. Moreover, a variant of the BSEA problem, called the Extended BSEA problem, is studied, and the above solutions can be extended to address this variant easily. Finally, we conduct extensive experiments on real and synthetic datasets which verify the efficiency and effectiveness of our proposed algorithms.

...read moreread less

68 citations

Journal Article•DOI•

An effective contrast sequential pattern mining approach to taxpayer behavior analysis

[...]

Zhigang Zheng¹, Wei Wei¹, Chunming Liu¹, Wei Cao¹, Longbing Cao¹, Maninder Bhatia² - Show less +2 more•Institutions (2)

University of Technology, Sydney¹, Australian Taxation Office²

01 Jul 2016-World Wide Web

TL;DR: This work develops a CSP mining approach, eCSP, by using an effective CSP-tree structure, which improves the PrefixSpan tree (Pei et al., 2001), and proposes some heuristics and interestingness filtering criteria, and integrates them into the C SP-tree seamlessly to reduce the search space and to find business-interesting patterns as well.

...read moreread less

Abstract: Data mining for client behavior analysis has become increasingly important in business, however further analysis on transactions and sequential behaviors would be of even greater value, especially in the financial service industry, such as banking and insurance, government and so on. In a real-world business application of taxation debt collection, in order to understand the internal relationship between taxpayers' sequential behaviors (payment, lodgment and actions) and compliance to their debt, we need to find the contrast sequential behavior patterns between compliant and non-compliant taxpayers. Contrast Patterns (CP) are defined as the itemsets showing the difference/discrimination between two classes/datasets (Dong and Li, 1999). However, the existing CP mining methods which can only mine itemset patterns, are not suitable for mining sequential patterns, such as time-ordered transactions in taxpayer sequential behaviors. Little work has been conducted on Contrast Sequential Pattern (CSP) mining so far. Therefore, to address this issue, we develop a CSP mining approach, eCSP, by using an effective CSP-tree structure, which improves the PrefixSpan tree (Pei et al., 2001) for mining contrast patterns. We propose some heuristics and interestingness filtering criteria, and integrate them into the CSP-tree seamlessly to reduce the search space and to find business-interesting patterns as well. The performance of the proposed approach is evaluated on three real-world datasets. In addition, we use a case study to show how to implement the approach to analyse taxpayer behaviour. The results show a very promising performance and convincing business value.

...read moreread less

42 citations

Journal Article•DOI•

Modeling dynamic recovery strategy for composite web services execution

[...]

Rafael Angarita¹, Marta Rukoz¹, Yudith Cardinale²•Institutions (2)

Paris Dauphine University¹, Simón Bolívar University²

01 Jan 2016-World Wide Web

TL;DR: A non intrusive dynamic fault tolerant model that analyses several levels of information: environment state, execution state, and QoS criteria, to dynamically decide the best recovery strategy when a failure occurs is proposed.

...read moreread less

Abstract: During the execution of Composite Web Services (CWS), a component Web Service (WS) can fail and can be repaired with strategies such WS retry, substitution, compensation, roll-back, replication, or checkpointing. Each strategy behaves differently on different scenarios, impacting the CWS QoS. We propose a non intrusive dynamic fault tolerant model that analyses several levels of information: environment state, execution state, and QoS criteria, to dynamically decide the best recovery strategy when a failure occurs. We present an experimental study to evaluate the model and determine the impact on QoS parameters of different recovery strategies; and evaluate the intrusiveness of our strategy during the normal execution of CWSs.

...read moreread less

30 citations

Journal Article•DOI•

Semantic change computation: A successive approach

[...]

Xuri Tang¹, Weiguang Qu², Xiaohe Chen²•Institutions (2)

Huazhong University of Science and Technology¹, Nanjing Normal University²

01 May 2016-World Wide Web

TL;DR: This paper argues that the succession-view better reflects the essence of semantic change and proposes a successive framework for automatic semantic change detection, which analyzes the semantic change at both the word and individual-sense level inside a word by transforming the task into change pattern detection over time series data.

...read moreread less

Abstract: The prevalence of creativity in the emergent online media language calls for more effective computational approach to semantic change. Two divergent metaphysical understandings are found with the task: juxtaposition-view of change and succession-view of change. This paper argues that the succession-view better reflects the essence of semantic change and proposes a successive framework for automatic semantic change detection. The framework analyzes the semantic change at both the word level and the individual-sense level inside a word by transforming the task into change pattern detection over time series data. At the word level, the framework models the word's semantic change with S-shaped model and successfully correlates change patterns with classical semantic change categories such as broadening, narrowing, new word coining, metaphorical change, and metonymic change. At the sense level, the framework measures the conventionality of individual senses and distinguishes categories of temporary word usage, basic sense, novel sense and disappearing sense, again with S-shaped model. Experiments at both levels yield increased precision rate as compared with the baseline, supporting the succession-view of semantic change.

...read moreread less

29 citations

Journal Article•DOI•

On personalized and sequenced route planning

[...]

Jian Dai¹, Chengfei Liu², Jiajie Xu³, Zhiming Ding⁴•Institutions (4)

Chinese Academy of Sciences¹, Swinburne University of Technology², Soochow University (Suzhou)³, Beijing University of Technology⁴

01 Jul 2016-World Wide Web

TL;DR: A highly expressive personalized route planning query-the Personalized and Sequenced Route (PSR) Query which considers both personalization and sequenced constraint is defined, and a novel framework to deal with the query is proposed.

...read moreread less

Abstract: Online trip planning is a popular service that has facilitated a lot of people greatly. However, little attention has been paid to personalized trip planning which is even more useful. In this paper, we define a highly expressive personalized route planning query-the Personalized and Sequenced Route (PSR) Query which considers both personalization and sequenced constraint, and propose a novel framework to deal with the query. The framework consists of three phases: guessing, crossover and refinement. The guessing phase strives to obtain one high quality route as the baseline to bound the search space into a circular region. The crossover phase heuristically improve the quality of multiple guessed routes via a modified genetic algorithm, which further narrows the radius of the search space. The refinement phase backwardly examines each candidate point and partial route to rule out impossible ones. The combination of these phases can efficiently and effectively narrow our search space via a few iterations. In the experiment part, we firstly show our evaluation results of each phase separately, proving the effectiveness of each phase. Then, we present the evaluation results of the combination of them, which offers insight into the merits of the proposed framework.

...read moreread less

25 citations

Journal Article•DOI•

A cross-media distance metric learning framework based on multi-view correlation mining and matching

[...]

Hong Zhang¹, Xingyu Gao², Ping Wu¹, Xin Xu¹•Institutions (2)

Wuhan University of Science and Technology¹, Chinese Academy of Sciences²

01 Mar 2016-World Wide Web

TL;DR: This paper proposes a novel cross-media distance metric learning framework based on sparse feature selection and multi-view matching which harmonizes different correlations with an iteration process and builds Cross-media Semantic Space for cross- media distance measure.

...read moreread less

Abstract: With the explosion of multimedia data, it is usual that different multimedia data often coexist in web repositories. Accordingly, it is more and more important to explore underlying intricate cross-media correlation instead of single-modality distance measure so as to improve multimedia semantics understanding. Cross-media distance metric learning focuses on correlation measure between multimedia data of different modalities. However, the existence of content heterogeneity and semantic gap makes it very challenging to measure cross-media distance. In this paper, we propose a novel cross-media distance metric learning framework based on sparse feature selection and multi-view matching. First, we employ sparse feature selection to select a subset of relevant features and remove redundant features for high-dimensional image features and audio features. Secondly, we maximize the canonical coefficient during image-audio feature dimension reduction for cross-media correlation mining. Thirdly, we further construct a Multi-modal Semantic Graph to find embedded manifold cross-media correlation. Moreover, we fuse the canonical correlation and the manifold information into multi-view matching which harmonizes different correlations with an iteration process and build Cross-media Semantic Space for cross-media distance measure. The experiments are conducted on image-audio dataset for cross-media retrieval. Experiment results are encouraging and show that the performance of our approach is effective.

...read moreread less

23 citations

Journal Article•DOI•

Query intent mining with multiple dimensions of web search data

[...]

Di Jiang¹, Kenneth Wai-Ting Leung², Wilfred Ng²•Institutions (2)

Baidu¹, Hong Kong University of Science and Technology²

01 May 2016-World Wide Web

TL;DR: This paper first proposes the Result-Oriented Framework (ROF), which is easy to implement and significantly improves both the precision and the recall of query intent mining, and proposes the Topic- Oriented framework (TOF), in order to significantly reduce the online time and memory consumptions for query intentmining.

...read moreread less

Abstract: Understanding the users' latent intents behind the search queries is critical for search engines. Hence, there has been an increasing attention on studying how to effectively mine the intents of search queries by analyzing search engine query log. However, we observe that the information richness of query log is not fully utilized so far and the information underuse heavily limits the performance of the existing methods. In this paper, we tackle the problem of query intent mining by taking full advantage of the information richness of query log from a multi-dimensional perspective. Specifically, we capture the latent relations between search queries via three different dimensions: the URL dimension, the session dimension and the term dimension. We first propose the Result-Oriented Framework (ROF), which is easy to implement and significantly improves both the precision and the recall of query intent mining. We further propose the Topic-Oriented Framework (TOF), in order to significantly reduce the online time and memory consumptions for query intent mining. TOF employs the Query Log Topic Model (QLTM) that derives the latent topics from query log to integrate the information of the three dimensions in a principled way. The latent topics that are considered as low-dimensional descriptions of the query relations and serve as the basis of efficient online query intent mining. We conduct extensive experiments on a major commercial search engine query log. Experimental results show that the two frameworks significantly outperform the state-of-the-art methods with respect to a variety of metrics.

...read moreread less

22 citations

Journal Article•DOI•

Tracking frequent items over distributed probabilistic data

[...]

Yongxin Tong¹, Xiaofei Zhang², Lei Chen²•Institutions (2)

Beihang University¹, Hong Kong University of Science and Technology²

01 Jul 2016-World Wide Web

TL;DR: A local threshold-based deterministic algorithm and a sketch-based sampling approximate algorithm are proposed to reduce the communication cost of tracking distributed probabilistic frequent items (TDPF).

...read moreread less

Abstract: Tracking frequent items (also called heavy hitters) is one of the most fundamental queries in real-time data due to its wide applications, such as logistics monitoring, association rule based analysis, etc. Recently, with the growing popularity of Internet of Things (IoT) and pervasive computing, a large amount of real-time data is usually collected from multiple sources in a distributed environment. Unfortunately, data collected from each source is often uncertain due to various factors: imprecise reading, data integration from multiple sources (or versions), transmission errors, etc. In addition, due to network delay and limited by the economic budget associated with large-scale data communication over a distributed network, an essential problem is to track the global frequent items from all distributed uncertain data sites with the minimum communication cost. In this paper, we focus on the problem of tracking distributed probabilistic frequent items (TDPF). Specifically, given k distributed sites S = {S1, ? , Sk}, each of which is associated with an uncertain database ??i$\mathcal {D}_{i}$ of size ni, a centralized server (or called a coordinator) H, a minimum support ratio r, and a probabilistic threshold t, we are required to find a set of items with minimum communication cost, each item X of which satisfies Pr(sup(X) ? r × N) >t, where sup(X) is a random variable to describe the support of X and N=?i=1kni$N={\sum }_{i=1}^{k}n_{i}$. In order to reduce the communication cost, we propose a local threshold-based deterministic algorithm and a sketch-based sampling approximate algorithm, respectively. The effectiveness and efficiency of the proposed algorithms are verified with extensive experiments on both real and synthetic uncertain datasets.

...read moreread less

22 citations

Journal Article•DOI•

Finding seeds to bootstrap focused crawlers

[...]

Karane Vieira¹, Luciano Barbosa², Altigran Soares da Silva¹, Juliana Freire³, Edleno Silva de Moura¹ - Show less +1 more•Institutions (3)

Federal University of Amazonas¹, IBM², New York University³

01 May 2016-World Wide Web

TL;DR: It is shown that the seeds can greatly influence the performance of crawlers, and a new framework for automatically finding seeds is proposed, which results in higher harvest rates and an improved topic coverage by providing crawlers a seed set that is large and varied.

...read moreread less

Abstract: Focused crawlers are effective tools for applications requiring a high number of pages belonging to a specific topic. Several strategies for implementing these crawlers have been proposed in the literature, which aim to improve crawling efficiency by increasing the number of relevant pages retrieved while avoiding non-relevant pages. However, an important aspect of these crawlers has been largely overlooked: the selection of the seed pages that serve as the starting points for a crawl. In this paper, we show that the seeds can greatly influence the performance of crawlers, and propose a new framework for automatically finding seeds. We describe a system that implements this framework and show, through a detailed experimental evaluation, that by providing crawlers a seed set that is large and varied, they not only obtain higher harvest rates but also an improved topic coverage.

...read moreread less

21 citations

Journal Article•DOI•

Web prefetching through efficient prediction by partial matching

[...]

Arpad Gellert¹, Adrian Florea¹•Institutions (1)

Lucian Blaga University of Sibiu¹

01 Sep 2016-World Wide Web

TL;DR: This work presents an efficient way to implement the prediction by partial matching as simple searches in the observation sequence, which can use high number of states in long web page access histories and higher order Markov chains at low complexity.

...read moreread less

Abstract: In this work we propose a prediction by partial matching technique to anticipate and prefetch web pages and files accessed via browsers. The goal is to reduce the delays necessary to load the web pages and files visited by the users. Since the number of visited web pages can be high, tree-based and table-based implementations can be inefficient from the representation point of view. Therefore, we present an efficient way to implement the prediction by partial matching as simple searches in the observation sequence. Thus, we can use high number of states in long web page access histories and higher order Markov chains at low complexity. The time-evaluations show that the proposed PPM implementation is significantly more efficient than previous implementations. We have enhanced the predictor with a confidence mechanism, implemented as saturating counters, which classifies dynamically web pages as predictable or unpredictable. Predictions are generated selectively only from web pages classified as predictable, improving thus the accuracy. The experiments show that the prediction by partial matching of order 4 with a history of 500 web pages is the optimal.

...read moreread less

19 citations

Journal Article•DOI•

Mobile phone name extraction from internet forums: a semi-supervised approach

[...]

Yangjie Yao¹, Aixin Sun¹•Institutions (1)

Nanyang Technological University¹

01 Sep 2016-World Wide Web

TL;DR: A method to recognize and normalize mobile phone names from domain-specific Internet forums by generating candidate names as the first step using a Conditional Random Field-based name recognizer.

...read moreread less

Abstract: Collecting users' feedback on products from Internet forums is challenging because users often mention a product with informal abbreviations or nicknames. In this paper, we propose a method named Gren to recognize and normalize mobile phone names from domain-specific Internet forums. Instead of directly recognizing phone names from sentences as in most named entity recognition tasks, we propose an approach to generating candidate names as the first step. The candidate names capture short forms, spelling variations, and nicknames of products, but are not noise free. To predict whether a candidate name mention in a sentence indeed refers to a specific phone model, a Conditional Random Field (CRF)-based name recognizer is developed. The CRF model is trained by using a large set of sentences obtained in a semi-automatic manner with minimal manual labeling effort. Lastly, a rule-based name normalization component maps a recognized name to its formal form. Evaluated on more than 4000 manually labeled sentences with about 1000 phone name mentions, Gren outperforms all baseline methods. Specifically, it achieves precision and recall of 0.918 and 0.875 respectively, with the best feature setting. We also provide detailed analysis of the intermediate results obtained by each of the three components in Gren.

...read moreread less

Journal Article•DOI•

Focused crawling for the hidden web

[...]

Panagiotis Liakos, Alexandros Ntoulas¹, Alexandros Labrinidis², Alex Delis•Institutions (2)

Zynga¹, University of Pittsburgh²

01 Jul 2016-World Wide Web

TL;DR: This work investigates how to build a focused Hidden Web crawler that can autonomously extract topic-specific pages from the Hidden Web by searching only the subset that is related to the corresponding area.

...read moreread less

Abstract: A constantly growing amount of high-quality information resides in databases and is guarded behind forms that users fill out and submit. The Hidden Web comprises all these information sources that conventional web crawlers are incapable of discovering. In order to excavate and make available meaningful data from the Hidden Web, previous work has focused on developing query generation techniques that aim at downloading all the content of a given Hidden Web site with the minimum cost. However, there are circumstances where only a specific part of such a site might be of interest. For example, a politics portal should not have to waste bandwidth or processing power to retrieve sports articles just because they are residing in databases also containing documents relevant to politics. In cases like this one, we need to make the best use of our resources in downloading only the portion of the Hidden Web site that we are interested in. We investigate how we can build a focused Hidden Web crawler that can autonomously extract topic-specific pages from the Hidden Web by searching only the subset that is related to the corresponding area. In this regard, we present an approach that progresses iteratively and analyzes the returned results in order to extract terms that capture the essence of the topic we are interested in. We propose a number of different crawling policies and we experimentally evaluate them with data from four popular sites. Our approach is able to download most of the content in search in all cases, using a significantly smaller number of queries compared to existing approaches.

...read moreread less

Journal Article•DOI•

From constrained to unconstrained datasets: an evaluation of local action descriptors and fusion strategies for interaction recognition

[...]

Chenqiang Gao¹, Luyu Yang¹, Yinhe Du¹, Zeming Feng¹, Jiang Liu¹ - Show less +1 more•Institutions (1)

Chongqing University of Posts and Telecommunications¹

01 Mar 2016-World Wide Web

TL;DR: A new unconstrained dataset, called WEB-interaction, collected from the Internet is introduced, which more represents realistic scenes and has much more challenges than existing datasets and the evaluation results reveal that MBHx and MBHy of Motion Boundary Histogram are important feature descriptors for interaction recognition andMBHx has relatively dominative information.

...read moreread less

Abstract: As an important task in computer vision, the interaction recognition has attracted extensive attention due to its widely potential applications. The existing methods mainly focus on the interaction recognition problem on constrained datasets with few variations of scenes, viewpoints, background clutter for the experimental purpose. The performance of the recently proposed methods on the available constrained dataset almost approaches to saturation, which is not adaptive to further evaluate the robustness of new methods. In this paper, we introduce a new unconstrained dataset, called WEB-interaction, collected from the Internet. Our WEB-interaction more represents realistic scenes and has much more challenges than existing datasets. Besides, we evaluate the state-of-the-art pipeline of interaction recognition on both WEB-interaction and UT-interaction datasets. The evaluation results reveal that MBHx and MBHy of Motion Boundary Histogram (MBH) are important feature descriptors for interaction recognition and MBHx has relatively dominative information. For fusion strategy, the late fusion benefits more to performance than early fusion. Filming condition effects are also evaluated on WEB-interaction dataset. In addition, the best average precision(AP) result of different features on our WEB-interaction dataset is 44.2 % and the mean is around 38 %. Compare to the UT-interaction dataset, our dataset has bigger improvement space, which is more significant to promote new methods.

...read moreread less

Journal Article•DOI•

Data Services with uncertain and correlated semantics

[...]

Abdelhamid Malki¹, Djamal Benslimane¹, Sidi Mohamed Benslimane², Mahmoud Barhamgi¹, Mimoun Malki², Parisa Ghodous¹, Khalil Drira³ - Show less +3 more•Institutions (3)

Lyon College¹, SIDI², Centre national de la recherche scientifique³

01 Jan 2016-World Wide Web

TL;DR: A Probabilistic approach to model the semantic uncertainty of data services along with their possible semantic views are represented in probabilistic service registry and the correlations among service semantics are modeled through a directed probabilism graphical model (Bayesian network).

...read moreread less

Abstract: Currently, a good portion of datasets on Internet are accessed through data services, where user's queries are answered as a composition of multiple data services. Defining the semantics of data services is the first step towards automating their composition. An interesting approach to define the semantics of data services is by describing them as semantic views over a domain ontology. However, defining such semantic views cannot always be done with certainty, especially when the service's returned data are too complex. In such case, a data service is associated with several possible semantic views. In addition, complex correlations may be present among these possible semantic views, mainly when data services encapsulate the same data sources. In this paper, we propose a probabilistic approach to model the semantic uncertainty of data services. Services along with their possible semantic views are represented in probabilistic service registry. The correlations among service semantics are modeled through a directed probabilistic graphical model (Bayesian network). Based on our modeling, we study the problem of compositing correlated data services to answer a user query, and propose an efficient method to compute the different possible compositions and their probabilities.

...read moreread less

Journal Article•DOI•

Efficient processing of top-k dominating queries in distributed environments

[...]

Daichi Amagata¹, Yuya Sasaki², Takahiro Hara¹, Shojiro Nishio¹•Institutions (2)

Osaka University¹, Nagoya University²

01 Jul 2016-World Wide Web

TL;DR: This paper addresses the challenging problem of processing top-K dominating queries in distributed networks and proposes a method for efficient top-k dominating data retrieval, which avoids redundant communication cost and latency and proposes an approximate version of the proposed method, which further reduces communication cost.

...read moreread less

Abstract: Due to the recent massive data generation, preference queries are becoming an increasingly important for users because such queries retrieve only a small number of preferable data objects from a huge multi-dimensional dataset. A top-k dominating query, which retrieves the k data objects dominating the highest number of data objects in a given dataset, is particularly important in supporting multi-criteria decision making because this query can find interesting data objects in an intuitive way exploiting the advantages of top-k and skyline queries. Although efficient algorithms for top-k dominating queries have been studied over centralized databases, there are no studies which deal with top-k dominating queries in distributed environments. The recent data management is becoming increasingly distributed, so it is necessary to support processing of top-k dominating queries in distributed environments. In this paper, we address, for the first time, the challenging problem of processing top-k dominating queries in distributed networks and propose a method for efficient top-k dominating data retrieval, which avoids redundant communication cost and latency. Furthermore, we also propose an approximate version of our proposed method, which further reduces communication cost. Extensive experiments on both synthetic and real data have demonstrated the efficiency and effectiveness of our proposed methods.

...read moreread less

Journal Article•DOI•

Does Web accessibility differ among banks

[...]

Pedro Lorca¹, Javier De Andrés¹, Ana Belén Martínez¹•Institutions (1)

University of Oviedo¹

01 May 2016-World Wide Web

TL;DR: It is concluded that Web Accessibility (WA) implementation is an important and affordable way that could be used to increase the possibilities to explore online information and to meet the Corporate Social Responsibility (CSR) demands of stakeholders.

...read moreread less

Abstract: Web Accessibility (WA) is an attribute that must be taken increasingly into account in the design of websites. In this paper we assess on whether the drivers of more accessible online information differ among banks using a structural equation modelling approach. It is concluded that Web Accessibility (WA) implementation is an important and affordable way that could be used to increase the possibilities to explore online information and to meet the Corporate Social Responsibility (CSR) demands of stakeholders. Our results show that smaller banks are more prone to WA implementation, which may help them to differentiate from its competitors and create strategic advantages.

...read moreread less

Journal Article•DOI•

Graph vs. bag representation models for the topic classification of web documents

[...]

George Papadakis¹, George Giannakopoulos, Georgios Paliouras•Institutions (1)

National and Kapodistrian University of Athens¹

01 Sep 2016-World Wide Web

TL;DR: This paper argues that the quality of Web documents varies significantly, and proposes the use of n-gram graphs, a model that goes beyond the bag-of-words representation, transforming every document into a graph: its nodes correspond to character or word n- grams and the co-occurring ones are connected by weighted edges.

...read moreread less

Abstract: Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. In this paper, we argue that its performance depends on the quality of Web documents, which varies significantly. For example, the curated content of news articles involves different challenges than the user-generated content of blog posts and Social Media messages. We experimentally verify our claim, quantifying the main factors that affect the performance of text classification. We also argue that the established bag-of-words representation models are inadequate for handling all document types, as they merely extract frequent, yet distinguishing terms from the textual content of the training set. Thus, they suffer from low robustness in the context of noisy or unseen content, unless they are enriched with contextual, application-specific information. In their place, we propose the use of n-gram graphs, a model that goes beyond the bag-of-words representation, transforming every document into a graph: its nodes correspond to character or word n-grams and the co-occurring ones are connected by weighted edges. Individual document graphs can be combined into class graphs and graph similarities are employed to position and classify documents into the vector space. This approach offers two advantages with respect to bag models: first, classification accuracy increases due to the contextual information that is encapsulated in the edges of the n-gram graphs. Second, it reduces the search space to a limited set of robust, endogenous features that depend on the number of classes, rather than the size of the vocabulary. Our thorough experimental study over three large, real-world corpora confirms the superior performance of n-gram graphs across the main types of Web documents.

...read moreread less

Journal Article•DOI•

AcT: Accuracy-aware crawling techniques for cloud-crawler

[...]

Kanik Gupta¹, Vishal Mittal¹, Bazir Bishnoi¹, Siddharth Maheshwari¹, Dhaval Patel¹ - Show less +1 more•Institutions (1)

Indian Institute of Technology Roorkee¹

01 Jan 2016-World Wide Web

TL;DR: A framework AcT is presented that supports two different accuracy-aware personalized crawling techniques to attain the optimal accuracy level of retrieving the information and a greedy strategy is proposed to discover the optimal crawling frequency and crawling schedule for the second scheme.

...read moreread less

Abstract: News aggregation websites collect news from various online sources using crawling techniques and provide a unified view to millions of users. Since, news sources update information frequently; aggregators have to recrawl them from time to time in order to have durable archiving of the news content. The majority of recrawling techniques assume the availability of unlimited resources and zero operating cost. However, in reality, the resources and budget are limited and it is impossible to crawl every news source at every point of time. To the best of our knowledge, none of the existing techniques discuss the crawling strategy that can retrieve the maximum amount of information in a resource/budget constrained environment. In this paper, we present a framework AcT that supports two different accuracy-aware personalized crawling techniques to attain the optimal accuracy level of retrieving the information. Given the crawling frequency as a resource constraint, the first scheme aims to find the optimal schedule that maximizes the accuracy. In the second scheme, we optimize the crawling frequency and the corresponding crawling schedule for a given accuracy level. We propose a supervised technique that monitors each news source for a particular time period and collect the news update patterns. The news update patterns are later analyzed using mixed integer programming to discover the optimal crawling schedule for the first scheme, whereas a greedy strategy is proposed to discover the optimal crawling frequency and crawling schedule for the second scheme. We develop a crawler for 87 news sources and performed a series of experiments to demonstrate the quality and efficiency of our proposed techniques against benchmark strategies.

...read moreread less

Journal Article•DOI•

Behavior evaluation for trust management based on formal distributed network monitoring

[...]

Jorge Lopez¹, Stephane Maag¹, Gerardo Morales²•Institutions (2)

Telecom SudParis¹, Galileo University²

01 Jan 2016-World Wide Web

TL;DR: This paper proposes a formal distributed network monitoring approach to analyze the packets exchanged by the entities, in order to prove a system is acting in a trustworthy manner, making use of distributed networkmonitoring.

...read moreread less

Abstract: Collaborative systems are growing in use and in popularity. The need to boost the methods concerning the interoperability is growing as well; therefore, trustworthy interactions of the different systems are a priority. The systems need to interact with users and other applications. The decision regarding with whom and how to interact with other users or applications depends on each application or system. In this paper, we focus on providing trust verdicts by evaluating the behaviors of different agents, making use of distributed network monitoring. This will provide trust management systems based on "soft trust" information regarding a trustee experience. We propose a formal distributed network monitoring approach to analyze the packets exchanged by the entities, in order to prove a system is acting in a trustworthy manner. Based on formal "trust properties", we analyze the systems' behaviors, then, we provide trust verdicts regarding those "trust properties". Furthermore, automatized testing is performed using a suite of tools we have developed, and finally, our methodology is applied to a real industrial DNS use case scenario.

...read moreread less

Journal Article•DOI•

Feature aggregating hashing for image copy detection

[...]

Lingyu Yan¹, Fuhao Zou², Rui Guo³, Lianli Gao⁴, Ke Zhou², Chunzhi Wang¹ - Show less +2 more•Institutions (4)

Hubei University of Technology¹, Huazhong University of Science and Technology², Southeast University³, University of Electronic Science and Technology of China⁴

01 Mar 2016-World Wide Web

TL;DR: A fast feature aggregating method for image copy detection which uses machine learning based hashing to achieve fast feature aggregation and generates binary codes which leads image representation building to be of low-complexity, making it efficient and scalable to large scale databases.

...read moreread less

Abstract: Currently, research on content based image copy detection mainly focuses on robust feature extraction. However, due to the exponential growth of online images, it is necessary to consider searching among large scale images, which is very time-consuming and unscalable. Hence, we need to pay much attention to the efficiency of image detection. In this paper, we propose a fast feature aggregating method for image copy detection which uses machine learning based hashing to achieve fast feature aggregation. Since the machine learning based hashing effectively preserves neighborhood structure of data, it yields visual words with strong discriminability. Furthermore, the generated binary codes leads image representation building to be of low-complexity, making it efficient and scalable to large scale databases. Experimental results show good performance of our approach.

...read moreread less

Journal Article•DOI•

Searching overlapping communities for group query

[...]

Jing Shan¹, Derong Shen¹, Tiezheng Nie¹, Yue Kou¹, Ge Yu¹ - Show less +1 more•Institutions (1)

Northeastern University (China)¹

01 Nov 2016-World Wide Web

TL;DR: This work proposes an overlapping community search framework for group query, including both exact and heuristic solutions, and proposes two parameters node degree and discovery power to trade off the efficiency and quality of the heuristic strategies, in order to make them satisfy different application requirements.

...read moreread less

Abstract: In most real life networks such as social networks and biology networks, a node often involves in multiple overlapping communities. Thus, overlapping community discovery has drawn a great deal of attention and there is a lot of research on it. However, most work has focused on community detection, which takes the whole network as input and derives all communities at one time. Community detection can only be used in offline analysis of networks and it is quite costly, not flexible and can not support dynamically evolving networks. Online community search which only finds overlapping communities containing a given node is a flexible and light-weight solution, and also supports dynamic graphs very well. However, in some scenarios, it requires overlapping community search for group query, which means that the input is a set of nodes instead of one single node. To solve this problem, we propose an overlapping community search framework for group query, including both exact and heuristic solutions. The heuristic solution has four strategies, some of which are adjustable and self-adaptive. We propose two parameters node degree and discovery power to trade off the efficiency and quality of the heuristic strategies, in order to make them satisfy different application requirements. Comprehensive experiments are conducted and demonstrate the efficiency and quality of both exact and heuristic solutions.

...read moreread less

Journal Article•DOI•

Leveraging declarative languages in web application development

[...]

Petri Vuorimaa¹, Markku Laine¹, Evgenia Litvinova¹, Denis Shestakov¹•Institutions (1)

Aalto University¹

01 Jul 2016-World Wide Web

TL;DR: How declarative languages can simplify Web Application development and empower end-users as Web developers is discussed and a unified XForms-based framework that supports both client-side and server-side Web application development is introduced.

...read moreread less

Abstract: Web Applications have become an omnipresent part of our daily lives. They are easy to use, but hard to develop. WYSIWYG editors, form builders, mashup editors, and markup authoring tools ease the development of Web Applications. However, more advanced Web Applications require servers-side programming, which is beyond the skills of end-user developers. In this paper, we discuss how declarative languages can simplify Web Application development and empower end-users as Web developers. We first identify nine end-user Web Application development levels ranging from simple visual customization to advanced three-tier programming. Then, we propose expanding the presentation tier to support all aspects of Web Application development. We introduce a unified XForms-based framework--called XFormsDB--that supports both client-side and server-side Web Application development. Furthermore, we make a language extension proposal--called XFormsRTC--for adding true real-time communication capabilities to XForms. We also present XFormsDB Integrated Development Environment (XIDE), which assists end-users in authoring highly interactive data-driven Web Applications. XIDE supports all Web Application development levels and, especially, promotes the transition from markup authoring and snippet programming to single and unified language programming.

...read moreread less

Journal Article•DOI•

Answering subgraph queries over massive disk resident graphs

[...]

Peng Peng¹, Lei Zou¹, Lei Chen², Xuemin Lin³, Dongyan Zhao¹ - Show less +1 more•Institutions (3)

Peking University¹, Hong Kong University of Science and Technology², University of New South Wales³

01 May 2016-World Wide Web

TL;DR: This paper focuses on subgraph query over a single large graph G, i.e., finding all embeddings of query Q in G and proposes a bitmap structure to index R2, which has the linear space complexity instead of exponential complexity in feature-based approaches.

...read moreread less

Abstract: Due to its wide applications, subgraph query has attracted lots of attentions in database community. In this paper, we focus on subgraph query over a single large graph G, i.e., finding all embeddings of query Q in G. Different from existing feature-based approaches, we map all edges into a two-dimensional space R2 and propose a bitmap structure to index R2. At run time, we find a set of adjacent edge pairs (AEP) or star-style patterns (SSP) to cover Q. We develop edge join (EJ) algorithms to address both AEP and SSP subqueries. Based on the bitmap index, our method can optimize I/O and CPU cost. More importantly, our index has the linear space complexity instead of exponential complexity in feature-based approaches, which indicates that our index can scale well with respect to large data size. Furthermore, our index has light maintenance overhead, which has not been considered in most of existing work. Extensive experiments show that our method significantly outperforms existing ones in both online and offline processing with respect to query response time, index building time, index size and index maintenance overhead.

...read moreread less

Journal Article•DOI•

Training query filtering for semi-supervised learning to rank with pseudo labels

[...]

Xin Zhang¹, Ben He¹, Tiejian Luo¹•Institutions (1)

Chinese Academy of Sciences¹

01 Sep 2016-World Wide Web

TL;DR: This paper assumes two application scenarios with respect to the availability of human labels and proposes a clustering-based and a classification-based approach, both of which outperform the strong baselines.

...read moreread less

Abstract: Semi-supervised learning is a machine learning paradigm that can be applied to create pseudo labels from unlabeled data for learning a ranking model, when there is only limited or no training examples available. However, the effectiveness of semi-supervised learning in information retrieval (IR) can be hindered by the low quality pseudo labels, hence the need for the training query filtering that removes the low quality queries. In this paper, we assume two application scenarios with respect to the availability of human labels. First, for applications without any labeled data available, a clustering-based approach is proposed to select the high quality training queries. This approach selects the training queries following the empirical observation that the relevant documents of high quality training queries are highly coherent. Second, for applications with limited labeled data available, a classification-based approach is proposed. This approach learns a weak classifier to predict the retrieval performance gain of a given training query by making use of query features. The queries with high performance gains are selected for the following transduction process to create the pseudo labels for learning to rank algorithms. Experimental results on the standard LETOR dataset show that our proposed approaches outperform the strong baselines.

...read moreread less

Journal Article•DOI•

Access and privacy control enforcement in RFID middleware systems: Proposal and implementation on the fosstrak platform

[...]

Wiem Tounsi¹, Nora Cuppens-Boulahia², Frédéric Cuppens², Guy Pujolle¹•Institutions (2)

University of Paris¹, École nationale supérieure des télécommunications de Bretagne²

01 Jan 2016-World Wide Web

TL;DR: This paper provides a privacy policy-driven model using some enhanced contextual concepts of the extended Role Based Access Control model, namely the purpose, the accuracy and the consent principles, and uses the provisional context to model security rules whose activation depends on the history of previously performed actions.

...read moreread less

Abstract: Radio Frequency IDentification (RFID) technology offers a new way of automating the identification and storing of information in RFID tags. The emerging opportunities for the use of RFID technology in human centric applications like monitoring and indoor guidance systems indicate how important this topic is in term of privacy. Holding privacy issues from the early stages of RFID data collection helps to master the data view before translating it into business events and storing it in databases. An RFID middleware is the entity that sits between tag readers and database applications. It is in charge of collecting, filtering and aggregating the requested events from heterogeneous RFID environments. Thus, the system, at this point, is likely to suffer from parameter manipulation and eavesdropping, raising privacy concerns. In this paper, we propose an access and privacy controller module that adds a security level to the RFID middleware standardized by the EPCglobal consortium. We provide a privacy policy-driven model using some enhanced contextual concepts of the extended Role Based Access Control model, namely the purpose, the accuracy and the consent principles. We also use the provisional context to model security rules whose activation depends on the history of previously performed actions. To show the feasibility of our privacy enforcement model, we first provide a proof-of-concept prototype integrated into the middleware of the Fosstrak platform, then evaluate the performance of the integrated module in terms of execution time.

...read moreread less

Journal Article•DOI•

Advertisement clicking prediction by using multiple criteria mathematical programming

[...]

Jong-Won Lee¹, Yong Shi², Fang Wang², Heeseok Lee, Heung Kee Kim¹ - Show less +1 more•Institutions (2)

Hoseo University¹, Chinese Academy of Sciences²

01 Jul 2016-World Wide Web

TL;DR: Four multiple criteria mathematical programming models for advertisement clicking problems are proposed and studies show that the MCLP and KMCP models have better performance stability and can be used to effectively handle behavioral targeting application for online advertisement problems.

...read moreread less

Abstract: In online advertisement industry, it is important to predict potentially profitable users who will click target ads (i.e., Behavioral targeting). The task selects the potential users that are likely to click the ads by analyzing user's clicking/web browsing information and displaying the most relevant ads to them. This paper proposes four multiple criteria mathematical programming models for advertisement clicking problems. First two are multi-criteria linear regression (MCLR) and kernel-based multiple criteria regression (KMCR) algorithms for click-through rate (CTR) prediction. The second two are multi-criteria linear programming (MCLP) and kernel-based multiple criteria programming (KMCP) algorithms, which are used to predict ads clicking events, such as identifying clicked ads in a set of ads. Using the experimental datasets from KDD Cup 2012, the paper first conducts a comparison of the proposed MCLR and KMCR with the methods of support vector regression (SVR) and logistic regression (LR), which shows that both MCLR and KMCR are good alternatives. Then the paper further studies the performance between the proposed MCLP and KMCP algorithms with known algorithms, including support vector machines (SVM), LR, radial basis function network (RBFN), k-nearest neighbor algorithm (KNN) and Naive Bayes (NB) in both prediction and selection processes. The studies show that the MCLP and KMCP models have better performance stability and can be used to effectively handle behavioral targeting application for online advertisement problems.

...read moreread less

Journal Article•DOI•

Identifying semantic blocks in Web pages using Gestalt laws of grouping

[...]

Zhen Xu¹, James Miller¹•Institutions (1)

University of Alberta¹

01 Sep 2016-World Wide Web

TL;DR: A new model to merge Web page content into semantic blocks by simulating human perception is proposed, and the efficiency of the model to compare the efficiency to a state-of-art algorithm, the VIPS is compared.

...read moreread less

Abstract: Semantic block identification is an approach to retrieve information from Web pages and applications. As Website design evolves, however, traditional methodologies cannot perform well any more. This paper proposes a new model to merge Web page content into semantic blocks by simulating human perception. A "layer tree" is constructed to remove hierarchical inconsistencies between the DOM tree representation and the visual layout of the Web page. Subsequently, the Gestalt laws of grouping are interpreted as the rules for semantic block detection. During interpretation, the normalized Hausdorff distance, the CIE-Lab color difference, the normalized compression distance, and the series of visual information are proposed to operationalize these Gestalt laws. Finally, a classifier is trained to combine each operationalized law into a unified rule for identifying semantic blocks from the Web page. Experiments are conducted to compare the efficiency of the model to a state-of-art algorithm, the VIPS. The comparison results of the first experiment show that the GLM model generates more "true positives" and less "false negatives" than VIPS. The next experiment upon a large-scale test set produces an average precision of 90.53 % and recall rate of 90.85 %, which is approximately 25 % better than that of VIPS.

...read moreread less

Journal Article•DOI•

TAFFY: incorporating tag information into a diffusion process for personalized recommendations

[...]

Mingxin Gan¹•Institutions (1)

University of Science and Technology Beijing¹

01 Sep 2016-World Wide Web

TL;DR: This work proposes a data fusion approach that integrates historical and tag data into complex networks, resorts to a diffusion kernel to measure the strength of associations between users and objects, and adopts Fisher's combined probability test to obtain the statistical significance of such associations for personalized recommendations.

...read moreread less

Abstract: The last few years have witnessed an explosion of information caused by the exponential growth of the Internet and World Wide Web, which confronted us with information overload and brought about an era of big data, appealing for efficient personalized recommender systems to assist the screening of useful information from various sources. As for a recommender system with more than the fundamental object-user rating information, such accessorial information as tags can be exploited and integrated into final ranking lists to improve recommendation performance. However, although existing studies have demonstrated that tags, as the additional yet useful resource, can be designed to improve recommendation performance, most network-based approaches take users, objects and tags as two bipartite graphs, or a tripartite graph, and therefore overlook either the important information among homogeneous nodes in each sub-graph, or the bipartite relations between users, objects or tags. Moreover, recent studies have suggested that the filtration of weak relationships in networks may reasonably enhance recommendation performance of collaborative filtering methods, and it has also been demonstrated that approaches based on the diffusion processes could more effectively capture relationships between objects and users, hence exhibiting higher performance than a typical collaborative filtering method. Based on these understandings, we propose a data fusion approach that integrates historical and tag data towards personalized recommendations. Our method coverts historical and tag data into complex networks, resorts to a diffusion kernel to measure the strength of associations between users and objects, and adopts Fisher's combined probability test to obtain the statistical significance of such associations for personalized recommendations. We validate our approach via 10-fold cross-validation experiments. Results show that our method outperforms existing methods in not only the recommendation accuracy and diversity, but also retrieval performance. We further show the robustness of our method to related parameters.

...read moreread less

Journal Article•DOI•

Finding smallest k-Compact tree set for keyword queries on graphs using mapreduce

[...]

Chengfei Liu¹, Liang Yao¹, Jianxin Li¹, Rui Zhou², Zhenying He³ - Show less +1 more•Institutions (3)

Swinburne University of Technology¹, Victoria University, Australia², Fudan University³

01 May 2016-World Wide Web

TL;DR: The smallest k-compact tree set is defined as the keyword query result, where no shared graph node exists between any two compact trees, and a progressive A* based scalable solution using MapReduce is developed to compute this set.

...read moreread less

Abstract: Keyword search is integrated in many applications on account of the convenience to convey users' query intention. Most existing works in keyword search on graphs modeled the query results as individual minimal connected trees or connected graphs that contain the keywords. We observe that significant overlap may exist among those query results, which would affect the result diversification. Besides, most solutions required accessing graph data and pre-built indexes in memory, which is not suitable to process big dataset. In this paper, we define the smallest k-compact tree set as the keyword query result, where no shared graph node exists between any two compact trees. We then develop a progressive A* based scalable solution using MapReduce to compute the smallest k-compact tree set, where the computation process could be stopped once the generated compact tree set is sufficient to compute the keyword query result. We conduct experiments to show the efficiency of our proposed algorithm.

...read moreread less

Journal Article•DOI•

Sub-event discovery and retrieval during natural hazards on social media data

[...]

Qunhui Wu¹, Shilong Ma¹, Yunzhen Liu¹•Institutions (1)

Beihang University¹

01 Mar 2016-World Wide Web

TL;DR: This paper proposes a new natural hazard sub-events discovery model SED (Sub-Events Discovery), which adopts multifarious features to detect sub- Events and introduces a novel SER algorithm from time-stamped social media data that makes use of automatically obtained messages from external search engines.

...read moreread less

Abstract: Social media sites contain a considerable amount of data for natural calamities events, such as earthquakes, snowstorms, mud-rock flows. With the increasing amount of social media data, an important task is to discover and retrieve sub-events over time. Especially in emergency situations, rescue and relief activities can be enhanced by identifying and retrieving sub-events of a natural hazard event. However, the existing event detection techniques in news-related reports cannot effectively work for social media data due to the unstructured of social network data. In this paper, we propose a new natural hazard sub-events discovery model SED (Sub-Events Discovery), which adopts multifarious features to detect sub-events. Moreover, in order to retrieve the sub-events over a specific event, we introduce a novel SER (Sub-Event Retrieval) algorithm from time-stamped social media data. Our novel approach SER makes use of automatically obtained messages from external search engines in the entire process. For purpose of determining the periodical convergence time for natural hazard event, our method provides online sub-events retrieval and sub-events discovery to meet the further needs. Next the improved estimation standards with timestamp are utilized in our experiments to verify the effectiveness and efficiency of SED model and SER algorithm.

...read moreread less