scispace - formally typeset
Search or ask a question
Topic

Semi-structured data

About: Semi-structured data is a research topic. Over the lifetime, 743 publications have been published within this topic receiving 14401 citations. The topic is also known as: semi structured data & semistructured data.


Papers
More filters
Book ChapterDOI
08 Jan 1997
TL;DR: The main purpose of the paper is to isolate the essential aspects of semistructured data, and survey some proposals of models and query languages for semi-structured data.
Abstract: The amount of data of all kinds available electronically has increased dramatically in recent years. The data resides in different forms, ranging from unstructured data in the systems to highly structured in relational database systems. Data is accessible through a variety of interfaces including Web browsers, database query languages, application-specic interfaces, or data exchange formats. Some of this data is raw data, e.g., images or sound. Some of it has structure even if the structure is often implicit, and not as rigid or regular as that found in standard database systems. Sometimes the structure exists but has to be extracted from the data. Sometimes also it exists but we prefer to ignore it for certain purposes such as browsing. We call here semi-structured data this data that is (from a particular viewpoint) neither raw data nor strictly typed, i.e., not table-oriented as in a relational model or sorted-graph as in object databases. As will seen later when the notion of semi-structured data is more precisely de ned, the need for semi-structured data arises naturally in the context of data integration, even when the data sources are themselves well-structured. Although data integration is an old topic, the need to integrate a wider variety of data- formats (e.g., SGML or ASN.1 data) and data found on the Web has brought the topic of semi-structured data to the forefront of research. The main purpose of the paper is to isolate the essential aspects of semi- structured data. We also survey some proposals of models and query languages for semi-structured data. In particular, we consider recent works at Stanford U. and U. Penn on semi-structured data. In both cases, the motivation is found in the integration of heterogeneous data.

878 citations

Journal ArticleDOI
TL;DR: Issues such as limitations, advantages, concerns and doubts regarding NoSQL databases are discussed.
Abstract: Many organizations collect vast amounts of customer, scientific, sales, and other data for future analysis. Traditionally, most of these organizations have stored structured data in relational databases for subsequent access and analysis. However, a growing number of developers and users have begun turning to various types of nonrelational, now frequently called NoSQL-databases. Nonrelational databases, including hierarchical, graph, and object-oriented databases-have been around since the late 1960s. However, new types of NoSQL databases are being developed. And only now are they beginning to gain market traction. Different NoSQL databases take different approaches. What they have in common is that they're not relational. Their primary advantage is that, unlike relational databases, they handle unstructured data such as word-processing files, e-mail, multimedia, and social media efficiently. This paper discuss issues such as limitations, advantages, concerns and doubts regarding NoSQL databases.

544 citations

Journal ArticleDOI
TL;DR: This paper presents SoftMealy, a novel wrapper representation formalism based on a finite-state transducer and contextual rules that can wrap a wide range of semistructured Web pages because FSTs can encode each different attribute permutation as a path.

476 citations

Proceedings ArticleDOI
26 Feb 2002
TL;DR: This work is motivated by the support for change control in the context of the Xyleme project that is investigating dynamic warehouses capable of storing massive volumes of XML data, and offers a diff algorithm for XML data that runs in average in linear time vs. quadratic time.
Abstract: We present a diff algorithm for XML data This work is motivated by the support for change control in the context of the Xyleme project that is investigating dynamic warehouses capable of storing massive volumes of XML data Because of the context, our algorithm has to be very efficient in terms of speed and memory space even at the cost of some loss of quality Also, it considers, besides insertions, deletions and updates (standard in diffs), a move operation on subtrees that is essential in the context of XML Intuitively, our diff algorithm uses signatures to match (large) subtrees that were left unchanged between the old and new versions Such exact matchings are then possibly propagated to ancestors and descendants to obtain more matchings It also uses XML specific information such as ID attributes We provide a performance analysis of the algorithm We show that it runs in average in linear time vs quadratic time for previous algorithms We present experiments on synthetic data that confirm the analysis Since this problem is NP-hard, the linear time is obtained by trading some quality We present experiments (again on synthetic data) that show that the output of our algorithm is reasonably close to the optimal in terms of quality Finally we present experiments on a small sample of XML pages found on the Web

474 citations

Patent
25 Feb 2002
TL;DR: A method for encoding XML tree data that includes the step of encoding the semi-structured data into strings of arbitrary length in a way that maintains non-structural and structural information about the XML data, and enables indexing the encoded XML data in a manner that facilitates efficient search and browsing is presented in this paper.
Abstract: A method for encoding XML tree data that includes the step of encoding the semi-structured data into strings of arbitrary length in a way that maintains non-structural and structural information about the XML data, and enables indexing the encoded XML data in a way that facilitates efficient search and browsing.

384 citations


Network Information
Related Topics (5)
Ontology (information science)
57K papers, 869.1K citations
78% related
Web service
57.6K papers, 989K citations
77% related
Server
79.5K papers, 1.4M citations
76% related
Scalability
50.9K papers, 931.6K citations
76% related
Web page
50.3K papers, 975.1K citations
75% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20237
202216
202110
202025
201914
201823