scispace - formally typeset
Search or ask a question

Showing papers on "Graph database published in 2009"


Journal ArticleDOI
01 Jul 2009
TL;DR: If decomposing useful graph operations in terms of MapReduce cycles is possible, it provides incentive for seriously considering cloud computing and offers a way to handle a large graph on a single machine that can't hold the entire graph as well as enables streaming graph processing.
Abstract: As the size of graphs for analysis continues to grow, methods of graph processing that scale well have become increasingly important. One way to handle large datasets is to disperse them across an array of networked computers, each of which implements simple sorting and accumulating, or MapReduce, operations. This cloud computing approach offers many attractive features. If decomposing useful graph operations in terms of MapReduce cycles is possible, it provides incentive for seriously considering cloud computing. Moreover, it offers a way to handle a large graph on a single machine that can't hold the entire graph as well as enables streaming graph processing. This article examines this possibility.

441 citations


Journal ArticleDOI
01 Aug 2009
TL;DR: Three novel methods to compute the upper and lower bounds for the edit distance between two graphs in polynomial time are introduced and result shows that these methods achieve good scalability in terms of both the number of graphs and the size of graphs.
Abstract: Graph data have become ubiquitous and manipulating them based on similarity is essential for many applications. Graph edit distance is one of the most widely accepted measures to determine similarities between graphs and has extensive applications in the fields of pattern recognition, computer vision etc. Unfortunately, the problem of graph edit distance computation is NP-Hard in general. Accordingly, in this paper we introduce three novel methods to compute the upper and lower bounds for the edit distance between two graphs in polynomial time. Applying these methods, two algorithms AppFull and AppSub are introduced to perform different kinds of graph search on graph databases. Comprehensive experimental studies are conducted on both real and synthetic datasets to examine various aspects of the methods for bounding graph edit distance. Result shows that these methods achieve good scalability in terms of both the number of graphs and the size of graphs. The effectiveness of these algorithms also confirms the usefulness of using our bounds in filtering and searching of graphs.

413 citations


Journal ArticleDOI
F. van Ham1, Adam Perer2
TL;DR: It is shown how Furnas' original degree of interest function can be adapted from trees to graphs and how it can be used to extract useful contextual subgraphs, control the complexity of the generated visualization and direct users to interesting datapoints in the context.
Abstract: A common goal in graph visualization research is the design of novel techniques for displaying an overview of an entire graph. However, there are many situations where such an overview is not relevant or practical for users, as analyzing the global structure may not be related to the main task of the users that have semi-specific information needs. Furthermore, users accessing large graph databases through an online connection or users running on less powerful (mobile) hardware simply do not have the resources needed to compute these overviews. In this paper, we advocate an interaction model that allows users to remotely browse the immediate context graph around a specific node of interest. We show how Furnas' original degree of interest function can be adapted from trees to graphs and how we can use this metric to extract useful contextual subgraphs, control the complexity of the generated visualization and direct users to interesting datapoints in the context. We demonstrate the effectiveness of our approach with an exploration of a dense online database containing over 3 million legal citations.

237 citations


Journal ArticleDOI
01 Aug 2009
TL;DR: This paper first transforms the vertices into points in a vector space via graph embedding techniques, coverting a pattern match query into a distance-based multi-way join problem over the converted vector space, and proposes several pruning strategies and a join order selection method to process join processing efficiently.
Abstract: The growing popularity of graph databases has generated interesting data management problems, such as subgraph search, shortest-path query, reachability verification, and pattern match. Among these, a pattern match query is more flexible compared to a subgraph search and more informative compared to a shortest-path or reachability query. In this paper, we address pattern match problems over a large data graph G. Specifically, given a pattern graph (i.e., query Q), we want to find all matches (in G) that have the similar connections as those in Q. In order to reduce the search space significantly, we first transform the vertices into points in a vector space via graph embedding techniques, coverting a pattern match query into a distance-based multi-way join problem over the converted vector space. We also propose several pruning strategies and a join order selection method to process join processing efficiently. Extensive experiments on both real and synthetic datasets show that our method outperforms existing ones by orders of magnitude.

231 citations


Proceedings ArticleDOI
29 Mar 2009
TL;DR: This work proposes a highly scalable technique, called GraphSig, to mine significant subgraphs from large graph databases, and develops a classifier using patterns mined by GraphSigs, which achieves superior performance over state-of-the-art classifiers.
Abstract: Graphs are being increasingly used to model a wide range of scientific data. Such widespread usage of graphs has generated considerable interest in mining patterns from graph databases. While an array of techniques exists to mine frequent patterns, we still lack a scalable approach to mine statistically significant patterns, specifically patterns with low p-values, that occur at low frequencies. We propose a highly scalable technique, called GraphSig, to mine significant subgraphs from large graph databases. We convert each graph into a set of feature vectors where each vector represents a region within the graph. Domain knowledge is used to select a meaningful feature set. Prior probabilities of features are computed empirically to evaluate statistical significance of patterns in the feature space. Following analysis in the feature space, only a small portion of the exponential search space is accessed for further analysis. This enables the use of existing frequent subgraph mining techniques to mine significant patterns in a scalable manner even when they are infrequent. Extensive experiments are carried out on the proposed techniques, and empirical results demonstrate that GraphSig is effective and efficient for mining significant patterns. To further demonstrate the power of significant patterns, we develop a classifier using patterns mined by GraphSig. Experimental results show that the proposed classifier achieves superior performance, both in terms of quality and computation cost, over state-of-the-art classifiers.

165 citations


Journal ArticleDOI
TL;DR: A novel graph-based learning framework in the setting of semi-supervised learning with multiple labels is proposed, characterized by simultaneously exploiting the inherent correlations among multiple labels and the label consistency over the graph.

156 citations


Journal ArticleDOI
TL;DR: This work presents a fully implemented graph visualization system, called CGV (Coordinated Graph Visualization), whose particular emphasis is on interaction, that has been designed as a dual-use system that can run as a stand-alone application or as an applet in a web browser.

119 citations


Patent
19 Mar 2009
TL;DR: In this article, a graph query engine may receive the query, and may convert the query into a relational query language, for execution by the relational database, for graph-based queries.
Abstract: In one example, information may be stored in a relational database. The information in the database may define a graph, in the sense that the information may define a set of entities and relations between the entities. A user may want to query the information using a graph-based query language. A graph query engine may receive the query, and may convert the query into a relational query language, for execution by the relational database. The relational database may calculate views of the underlying tables. Each view corresponds to a particular relation, and the rows in each view are pairs of entities to which the relation applies. Since the views correspond very closely to the specification of a graph, the graph-based query may be translated into a relational query that performs relational algebraic operations on the views in order to answer the graph-based query.

79 citations


Book ChapterDOI
06 Nov 2009
TL;DR: This paper proposes the DOGMA index for fast subgraph matching on disk and develops a basic algorithm to answer queries over this index, which is then significantly sped up via an optimized algorithm that uses efficient (but correct) pruning strategies when combined with two different extensions of the index.
Abstract: RDF is an increasingly important paradigm for the representation of information on the Web. As RDF databases increase in size to approach tens of millions of triples, and as sophisticated graph matching queries expressible in languages like SPARQL become increasingly important, scalability becomes an issue. To date, there is no graph-based indexing method for RDF data where the index was designed in a way that makes it disk-resident. There is therefore a growing need for indexes that can operate efficiently when the index itself resides on disk. In this paper, we first propose the DOGMA index for fast subgraph matching on disk and then develop a basic algorithm to answer queries over this index. This algorithm is then significantly sped up via an optimized algorithm that uses efficient (but correct) pruning strategies when combined with two different extensions of the index. We have implemented a preliminary system and tested it against four existing RDF database systems developed by others. Our experiments show that our algorithm performs very well compared to these systems, with orders of magnitude improvements for complex graph queries.

76 citations


Proceedings ArticleDOI
02 Nov 2009
TL;DR: The frequent subgraph pattern mining problem is formalized by designing a new measure called expected support and an approximate mining algorithm is proposed to find an approximate set of frequent sub graph patterns by allowing an error tolerance on the expected supports of the discovered subgraph patterns.
Abstract: Graph data are subject to uncertainties in many applications due to incompleteness and imprecision of data. Mining uncertain graph data is semantically different from and computationally more challenging than mining exact graph data. This paper investigates the problem of mining frequent subgraph patterns from uncertain graph data. The frequent subgraph pattern mining problem is formalized by designing a new measure called expected support. An approximate mining algorithm is proposed to find an approximate set of frequent subgraph patterns by allowing an error tolerance on the expected supports of the discovered subgraph patterns. The algorithm uses an efficient approximation algorithm to determine whether a subgraph pattern can be output or not. The analytical and experimental results show that the algorithm is very efficient, accurate and scalable for large uncertain graph databases.

66 citations


Journal ArticleDOI
TL;DR: An efficient index, FG*-index, which is dynamically constructed from the set of Frequently Asked non-FG-Queries (FAQs), and using the FAQ- index, verification is not required for processing FAQs and only a small number of candidates need to be verified for processing non- FG-queries that are not frequently asked.
Abstract: We study the problem of processing subgraph queries on a database that consists of a set of graphs. The answer to a subgraph query is the set of graphs in the database that are supergraphs of the query. In this article, we propose an efficient index, FGa-index, to solve this problem.The cost of processing a subgraph query using most existing indexes mainly consists of two parts: the index probing cost and the candidate verification cost. Index probing is to find the query in the index, or to find the graphs from which we can generate a candidate answer set for the query. Candidate verification is to test whether each graph in the candidate set is indeed a supergraph of the query. We design FGa-index to minimize these two costs as follows.FGa-index consists of three components: the FG-index, the feature-index, and the FAQ-index. First, the FG-index employs the concept of Frequent subGraph (FG) to allow the set of queries that are FGs to be answered without candidate verification. We call this set of queries FG-queries. We can enlarge the set of FG-queries so that more queries can be answered without candidate verification; however, a larger set of FG-queries implies a larger FG-index and hence the index probing cost also increases. We propose the feature-index to reduce the index probing cost. The feature-index uses features to filter false results that are matched in the FG-index, so that we can quickly find the truly matching graphs for a query. For processing non-FG-queries, we propose the FAQ-index, which is dynamically constructed from the set of Frequently Asked non-FG-Queries (FAQs). Using the FAQ-index, verification is not required for processing FAQs and only a small number of candidates need to be verified for processing non-FG-queries that are not frequently asked. Finally, a comprehensive set of experiments verifies that query processing using FGa-index is up to orders of magnitude more efficient than state-of-the-art indexes and it is also more scalable.

Patent
12 Mar 2009
TL;DR: In this article, a large open database of information has entries for commonly understood data, such as people, places and objects, which are referred to as topics, which have a type system and contains attributes and relationships between topics.
Abstract: A large open database of information has entries for commonly understood data, such as people, places and objects, which are referred to as topics. The database has a type system and contains attributes and relationships between topics. The invention also comprises a powerful query language and an open API to access the data and a website where contributors can update the data or add new topics and relationships. The elements of the invention comprise a scalable graph database, a dynamic user contributed schema representation, a tree-based object/property query language, a series of new Web service APIs, and set of AJAX dynamic HTML technologies.

Proceedings ArticleDOI
24 Mar 2009
TL;DR: An optimal compact method for organizing graph databases is proposed, a novel algorithm of testing subgraph isomorphisms from multiple graphs to one graph is presented, and a query processing method is proposed based on these techniques.
Abstract: In recent years, large amount of data modeled by graphs, namely graph data, have been collected in various domains. Efficiently processing queries on graph databases has attracted a lot of research attentions. Supergraph query is a kind of new and important queries in practice. A supergraph query, q, on a graph database D is to retrieve all graphs in D such that q is a supergraph of them. Because the number of graphs in databases is large and subgraph isomorphism testing is NP-complete, efficiently processing such queries is a big challenge. This paper first proposes an optimal compact method for organizing graph databases. Common subgraphs of the graphs in a database are stored only once in the compact organization of the database, in order to reduce the overall cost of subgraph isomorphism testings from stored graphs to queries during query processing. Then, an exact algorithm and an approximate algorithm for generating significant feature set with optimal order are proposed to construct indices on graph databases. The optimal order on the feature set is to reduce the number of subgraph isomorphism testings during query processing. Based on the compact organization of graph databases, a novel algorithm of testing subgraph isomorphisms from multiple graphs to one graph is presented. Finally, based on all these techniques, a query processing method is proposed. Analytical and experimental results show that the proposed algorithms outper-form the existing similar algorithms by one to two orders of magnitude.

Proceedings ArticleDOI
02 Nov 2009
TL;DR: This work proposes a Three-way Triple Tree (TripleT) secondary memory indexing technique to facilitate flexible and efficient join evaluation on RDF data and shows that TripleT exhibits multiple orders of magnitude improvement over the state-of-the-art, in terms of both storage and query processing costs.
Abstract: Current approaches to RDF graph indexing suffer from weak data locality, i.e., information regarding a piece of data appears in multiple locations, spanning multiple data structures. Weak data locality negatively impacts storage and query processing costs. Towards stronger data locality, we propose a Three-way Triple Tree (TripleT) secondary memory indexing technique to facilitate flexible and efficient join evaluation on RDF data. The novelty of TripleT is that the index is built over the atoms occurring in the data set, rather than at a coarser granularity, such as whole triples occurring in the data set; and, the atoms are indexed regardless of the roles (i.e., subjects, predicates, or objects) they play in the triples of the data set. We show through extensive empirical evaluation that TripleT exhibits multiple orders of magnitude improvement over the state-of-the-art, in terms of both storage and query processing costs.

Proceedings ArticleDOI
20 Apr 2009
TL;DR: A scalable algorithm is presented that enumerates all maximal bipartite cliques (bicliques) from a click-through graph and compute an equivalence set of queries (i.e., a query cluster) from the maximal biclique.
Abstract: In this paper we describe a problem of discovering query clusters from a click-through graph of web search logs. The graph consists of a set of web search queries, a set of pages selected for the queries, and a set of directed edges that connects a query node and a page node clicked by a user for the query. The proposed method extracts all maximal bipartite cliques (bicliques) from a click-through graph and compute an equivalence set of queries (i.e., a query cluster) from the maximal bicliques. A cluster of queries is formed from the queries in a biclique. We present a scalable algorithm that enumerates all maximal bicliques from the click-through graph. We have conducted experiments on Yahoo web search queries and the result is promising.

Proceedings ArticleDOI
29 Mar 2009
TL;DR: A light-weight yet effective feature structure called Node-Neighbor Tree to filter false candidate query-stream pairs and two methods to efficiently check dominant relationships in the projected space are proposed.
Abstract: Search over graph databases has attracted much attention recently due to its usefulness in many fields, such as the analysis of chemical compounds, intrusion detection in network traffic data, and pattern matching over users' visiting logs. However, most of the existing work focuses on search over static graph databases while in many real applications graphs are changing over time. In this paper we investigate a new problem on continuous subgraph pattern search under the situation where multiple target graphs are constantly changing in a stream style, namely the subgraph pattern search over graph streams. Obviously the proposed problem is a continuous join between query patterns and graph streams where the join predicate is the existence of subgraph isomorphism. Due to the NP-completeness of subgraph isomorphism checking, to achieve the real time monitoring of the existence of certain subgraph patterns, we would like to avoid using subgraph isomorphism verification to find the exact query-stream subgraph isomorphic pairs but to offer an approximate answer that could report all probable pairs without missing any of the actual answer pairs. In this paper we propose a light-weight yet effective feature structure called Node-Neighbor Tree to filter false candidate query-stream pairs. To reduce the computational cost, we further project the feature structures into a numerical vector space and conduct dominant relationship checking in the projected space. We propose two methods to efficiently check dominant relationships and substantiate our methods with extensive experiments.

Proceedings ArticleDOI
02 Nov 2009
TL;DR: A data model and a query language for facilitating the analysis of networks that provides for a closure property, in which the output of every query can be stored in the database and used for further querying.
Abstract: With more and more large networks becoming available, mining and querying such networks are increasingly important tasks which are not being supported by database models and querying languages. This paper wants to alleviate this situation by proposing a data model and a query language for facilitating the analysis of networks. Key features include support for executing external tools on the networks, flexible contexts on the network each resulting in a different graph, primitives for querying subgraphs (including paths) and transforming graphs. The data model provides for a closure property, in which the output of every query can be stored in the database and used for further querying.

Proceedings ArticleDOI
06 Dec 2009
TL;DR: This work shows that for a specific parametric graph model, the Kroneckergraph model, one can construct an estimator of the true parameter in a way that both satisfies the rigorous requirements of differential privacy and is asymptotically efficient in the statistical sense.
Abstract: We consider the problem of making graph databases such as social network structures available to researchers for knowledge discovery while providing privacy to the participating entities. We show that for a specific parametric graph model, the Kronecker graph model, one can construct an estimator of the true parameter in a way that both satisfies the rigorous requirements of differential privacy and is asymptotically efficient in the statistical sense. The estimator, which may then be published, defines a probability distribution on graphs. Sampling such a distribution yields a synthetic graph that mimics important properties of the original sensitive graph and, consequently, could be useful for knowledge discovery.

Proceedings ArticleDOI
29 Mar 2009
TL;DR: This work partitions the graph into a set of communities based on the concept of modularity, where each community becomes naturally the context of the nodes within the community, and extends the connection to the inter-community level by utilizing the community hierarchy relation.
Abstract: Given a large graph and a set of objects, the task of object connection discovery is to find a subgraph that retains the best connection between the objects Object connection discovery is useful to many important applications such as discovering the connection between different terrorist groups for counter-terrorism operations Existing work considers only the connection between individual objects; however, in many real problems the objects usually have a context (eg, a terrorist belongs to a terrorist group) We identify the context for the nodes in a large graph We partition the graph into a set of communities based on the concept of modularity, where each community becomes naturally the context of the nodes within the community By considering the context we also significantly improve the efficiency of object connection discovery, since we break down the big graph into much smaller communities We first compute the best intra-community connection by maximizing the amount of information flow in the answer graph Then, we extend the connection to the inter-community level by utilizing the community hierarchy relation, while the quality of the inter-community connection is also ensured by modularity Our experiments show that our algorithm is three orders of magnitude faster than the state-of-the-art algorithm, while the quality of the query answer is comparable

Patent
04 Jun 2009
TL;DR: In this article, the authors present methods and systems to simultaneously create software applications and documents compatible to all electronic and print platforms from a single database source using graph database technology, including a graph data structure having two or more points of data interconnected by a semantic relationship.
Abstract: The present invention provides methods and systems to simultaneously create software applications and documents compatible to all electronic and print platforms from a single database source using graph database technology. The methods and systems may include a graph data structure having two or more points of data interconnected by a semantic relationship, and a transform for the conversion of the graph data structure into a platform data structure specific to a native platform application. The semantic relationship enables processing of the graph data structure into a plurality of distinct media. The platform data structure may include information to represent the data on the platform. Changes to the data may be made in the source graph data structure in order to update one or more platform data structures.

Proceedings ArticleDOI
24 Mar 2009
TL;DR: In this article, a graph kernel function is defined to capture the intrinsic similarity of graphs and for fast similarity query processing, and a hash table is utilized to support efficient storage and fast search of the extracted local features.
Abstract: Structured data including sets, sequences, trees and graphs, pose significant challenges to fundamental aspects of data management such as efficient storage, indexing, and similarity search. With the fast accumulation of graph databases, similarity search in graph databases has emerged as an important research topic. Graph similarity search has applications in a wide range of domains including cheminformatics, bioinformatics, sensor network management, social network management, and XML documents, among others.Most of the current graph indexing methods focus on subgraph query processing, i.e. determining the set of database graphs that contains the query graph and hence do not directly support similarity search. In data mining and machine learning, various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models for supervised learning, graph kernel functions have (i) high computational complexity and (ii) non-trivial difficulty to be indexed in a graph database.Our objective is to bridge graph kernel function and similarity search in graph databases by proposing (i) a novel kernel-based similarity measurement and (ii) an efficient indexing structure for graph data management. Our method of similarity measurement builds upon local features extracted from each node and their neighboring nodes in graphs. A hash table is utilized to support efficient storage and fast search of the extracted local features. Using the hash table, a graph kernel function is defined to capture the intrinsic similarity of graphs and for fast similarity query processing. We have implemented our method, which we have named G-hash, and have demonstrated its utility on large chemical graph databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Most importantly, the new similarity measurement and the index structure is scalable to large database with smaller indexing size, faster indexing construction time, and faster query processing time as compared to state-of-the-art indexing methods such as C-tree, gIndex, and GraphGrep.

Proceedings ArticleDOI
28 Jun 2009
TL;DR: The discovered graph-rewriting rules show how biological networks change over time, and the transformation rules show the repeated patterns in the structural changes.
Abstract: Our dynamic graph-based relational mining approach has been developed to learn structural patterns in biological networks as they change over time. The analysis of dynamic networks is important not only to understand life at the system-level, but also to discover novel patterns in other structural data. Most current graph-based data mining approaches overlook dynamic features of biological networks, because they are focused on only static graphs. Our approach analyzes a sequence of graphs and discovers rules that capture the changes that occur between pairs of graphs in the sequence. These rules represent the graph rewrite rules that the first graph must go through to be isomorphic to the second graph. Then, our approach feeds the graph rewrite rules into a machine learning system that learns general transformation rules describing the types of changes that occur for a class of dynamic biological networks. The discovered graph-rewriting rules show how biological networks change over time, and the transformation rules show the repeated patterns in the structural changes. In this paper, we apply our approach to biological networks to evaluate our approach and to understand how the biosystems change over time. We evaluate our results using coverage and prediction metrics, and compare to biological literature.

Book ChapterDOI
16 Mar 2009
TL;DR: This paper proposes a novel, decomposition-based and selectivity-aware SQL translation mechanism of sub-graph search queries for relational database management systems, and carefully exploits existing database functionality such as partitioned B-trees indexes and influencing the relational query optimizers by selectivity annotations to reduce the access costs of the secondary storage to a minimum.
Abstract: Graphs are widely used for modelling complicated data such as: chemical compounds, protein interactions, XML documents and multimedia. Retrieving related graphs containing a query graph from a large graph database is a key issue in many graph-based applications such as drug discovery and structural pattern recognition. Relational database management systems (RDBMSs) have repeatedly been shown to be able to efficiently host different types of data which were not formerly anticipated to reside within relational databases such as complex objects and XML data.The key advantages of relational database systems are its well-known maturity and its ability to scale to handle vast amounts of data very efficiently. RDMBSs derive much of their performance from sophisticated optimizer components which makes use of physical properties that are specific to the relational model such as: sortedness, proper join ordering and powerful indexing mechanisms. In this paper, we study the problem of indexing and querying graph databases using the relational infrastructure. We propose a novel, decomposition-based and selectivity-aware SQL translation mechanism of sub-graph search queries. Moreover, we carefully exploit existing database functionality such as partitioned B-trees indexes and influencing the relational query optimizers by selectivity annotations to reduce the access costs of the secondary storage to a minimum. Finally, our experiments utilise an IBM DB2 RDBMS as a concrete example to confirm that relational database systems can be used as an efficient and very scalable processor for sub-graph queries.

Proceedings Article
01 Jan 2009
TL;DR: An efficient algorithm is proposed, TopCor, which mines the top-k correlative graphs by exploring only the candidate graphs in the projected database of a query graph, and three key techniques are developed: an effective correlation checking mechanism, a powerful pruning criteria, and a set of useful rules for candidate exploration.
Abstract: Correlation mining has been widely studied due to its ability for discovering the underlying occurrence dependency between objects. However, correlation mining in graph databases is expensive due to the complexity of graph data. In this paper, we study the problem of mining top-k correlative subgraphs in the database, which share similar occurrence distributions with a given query graph. The search space of the problem is prohibitively large since every subgraph in the database is a candidate. We propose an efficient algorithm, TopCor, which mines the top-k correlative graphs by exploring only the candidate graphs in the projected database of a query graph. We develop three key techniques for TopCor: an effective correlation checking mechanism, a powerful pruning criteria, and a set of useful rules for candidate exploration. The three key techniques are very effective in directing the search to those highly correlative candidate graphs. We justify by experiments the effectiveness of the three key techniques and show that TopCor is more than an order of magnitude faster than CGSearch, the state-of-the-art threshold-based correlative graph mining algorithm.

Proceedings ArticleDOI
26 Jul 2009
TL;DR: This paper compares the results achieved on public graph databases for the classification of symbols and letters using this graph signature with those obtained using the graph edit distance.
Abstract: In this article we present a new approach for the classification of structured data using graphs. We suggest to solve the problem of complexity in measuring the distance between graphs by using a new graph signature. We present an extension of the vector representation based on pattern frequency, which integrates labeling information. In this paper, we compare the results achieved on public graph databases for the classification of symbols and letters using this graph signature with those obtained using the graph edit distance.

Proceedings ArticleDOI
15 May 2009
TL;DR: This work presents an empirical comparison of the major approaches for graph classification introduced in literature, namely, SubdueCL, frequent subgraph mining in conjunction with SVMs, walk-based graph kernel,Walk-basedgraph kernel, frequentSubgraph miningIn conjunction with AdaBoost and DT-CLGBI, andFSG+SVM, FSG+AdaBoost, DT-ClGBI have comparable performance in most cases.
Abstract: The graph classification problem is learning to classify separate, individual graphs in a graph database into two or more categories. A number of algorithms have been introduced for the graph classification problem. We present an empirical comparison of the major approaches for graph classification introduced in literature, namely, SubdueCL, frequent subgraph mining in conjunction with SVMs, walk-based graph kernel, frequent subgraph mining in conjunction with AdaBoost and DT-CLGBI. Experiments are performed on five real world data sets from the Mutagenesis and Predictive Toxicology domain which are considered benchmark data sets for the graph classification problem. Additionally, experiments are performed on a corpus of artificial data sets constructed to investigate the performance of the algorithms across a variety of parameters of interest. Our conclusions are as follows. In datasets where the underlying concept has a high average degree, walk-based graph kernels perform poorly as compared to other approaches. The hypothesis space of the kernel is walks and it is insufficient at capturing concepts involving significant structure. In datasets where the underlying concept is disconnected, SubdueCL performs poorly as compared to other approaches. The hypothesis space of SubdueCL is connected graphs and it is insufficient at capturing concepts which consist of a disconnected graph. FSG+SVM, FSG+AdaBoost, DT-CLGBI have comparable performance in most cases.

Proceedings ArticleDOI
Marc Najork1
29 Jun 2009
TL;DR: The Scalable Hyperlink Store is described, a distributed in-memory "database" for storing large portions of the web graph that is fast, scalable, fault-tolerant, and incrementally updateable.
Abstract: This paper describes the Scalable Hyperlink Store, a distributed in-memory "database" for storing large portions of the web graph. SHS is an enabler for research on structural properties of the web graph as well as new link-based ranking algorithms. Previous work on specialized hyperlink databases focused on finding efficient compression algorithms for web graphs. By contrast, this work focuses on the systems issues of building such a database. Specifically, it describes how to build a hyperlink database that is fast, scalable, fault-tolerant, and incrementally updateable.

Journal ArticleDOI
TL;DR: The introduced fuzzy conceptual model is used in this framework, which provides modeling of complex and rich semantic content and knowledge of video data including uncertainty, and it supports various flexible queries including (fuzzy) semantic, temporal and spatial queries, based on the video data model.

Book ChapterDOI
04 Sep 2009
TL;DR: After comparing existing vulnerability databases, a new method is proposed for automatic extraction of vulnerability information from textual descriptions and a prototype was implemented to proof the applicability of the proposed method for attack graph construction.
Abstract: Attack graph is used as an effective method to model, analyze, and evaluate the security of complicated computer systems or networks. The attack graph workflow consists of three parts: information gathering, attack graph construction, and visualization. To construct an attack graph, runtime information on the target system or network environment should be monitored, gathered, and later evaluated with existing descriptions of known vulnerabilities. The output will be visualized into a graph structure for further measurements. Information gatherer, vulnerability repository, and the visualization module are three important components of an attack graph constructor. However, high quality attack graph construction relies on up-to-date vulnerability information. There are already some existing databases maintained by security companies, a community, or governments. Such databases can not be directly used for generating attack graph, due to missing unification of the provided information. This paper challenged the automatic extraction of meaningful information from various existing vulnerability databases. After comparing existing vulnerability databases, a new method is proposed for automatic extraction of vulnerability information from textual descriptions. Finally, a prototype was implemented to proof the applicability of the proposed method for attack graph construction.

Journal ArticleDOI
TL;DR: A querying framework, along with a number of graph-theoretic algorithms from simple neighborhood queries to shortest paths to feedback loops, that is applicable to all sorts of graphs and structures present in graph-based pathway databases, from PPIs (protein-protein interactions) to metabolic and signaling pathways.
Abstract: Graph-based pathway ontologies and databases are widely used to represent data about cellular processes. This representation makes it possible to programmatically integrate cellular networks and to investigate them using the well-understood concepts of graph theory in order to predict their structural and dynamic properties. An extension of this graph representation, namely hierarchically structured or compound graphs, in which a member of a biological network may recursively contain a sub-network of a somehow logically similar group of biological objects, provides many additional benefits for analysis of biological pathways, including reduction of complexity by decomposition into distinct components or modules. In this regard, it is essential to effectively query such integrated large compound networks to extract the sub-networks of interest with the help of efficient algorithms and software tools. Towards this goal, we developed a querying framework, along with a number of graph-theoretic algorithms from simple neighborhood queries to shortest paths to feedback loops, that is applicable to all sorts of graph-based pathway databases, from PPIs (protein-protein interactions) to metabolic and signaling pathways. The framework is unique in that it can account for compound or nested structures and ubiquitous entities present in the pathway data. In addition, the queries may be related to each other through "AND" and "OR" operators, and can be recursively organized into a tree, in which the result of one query might be a source and/or target for another, to form more complex queries. The algorithms were implemented within the querying component of a new version of the software tool PATIKA web (Pathway Analysis Tool for Integration and Knowledge Acquisition) and have proven useful for answering a number of biologically significant questions for large graph-based pathway databases. The PATIKA Project Web site is http://www.patika.org . PATIKA web version 2.1 is available at http://web.patika.org .