scispace - formally typeset
Search or ask a question

Showing papers on "XML published in 2010"


Proceedings Article
23 Aug 2010
TL;DR: LTP (Language Technology Platform) is an integrated Chinese processing platform which includes a suite of high performance natural language processing modules and relevant corpora that achieved good results in some relevant evaluations, such as CoNLL and SemEval.
Abstract: LTP (Language Technology Platform) is an integrated Chinese processing platform which includes a suite of high performance natural language processing (NLP) modules and relevant corpora. Especially for the syntactic and semantic parsing modules, we achieved good results in some relevant evaluations, such as CoNLL and SemEval. Based on XML internal data representation, users can easily use these modules and corpora by invoking DLL (Dynamic Link Library) or Web service APIs (Application Program Interface), and view the processing results directly by the visualization tool.

428 citations


Patent
30 Dec 2010
TL;DR: In this article, the authors describe a system where software rendering software is interposed in the data communication path between a browser running on a user computer and the internet data sources (for example, internet accessible server computers) that the user browser wants to receive information from.
Abstract: A computer network communication method and system wherein software rendering software is interposed in the data communication path between a browser running on a user computer and the internet data sources (for example, internet-accessible server computers) that the user browser wants to receive information from. The software rendering application gets data from internet data sources, but this data may contain malware. To provide enhanced security, the software rendering application renders this data to form a new browser readable code set (for example, an xml page with CSS layers), and this new and safe browser readable code set is sent along to the browser on the user computer for appropriate presentation to the user. As part of the rendering process, dedicated and distinct virtual machines may be used to render certain portion of the data, such as executable code. These virtual machines may be watched, and quickly destroyed if it is detected that they have encountered some type of malware.

179 citations


Patent
12 Mar 2010
TL;DR: In this article, a language-neutral speech grammar extensible markup language (GRXML) document and a localized response document are used to build a localized GRXML document.
Abstract: A language-neutral speech grammar extensible markup language (GRXML) document and a localized response document are used to build a localized GRXML document. The language-neutral GRXML document specifies an initial grammar rule element. The initial grammar rule element specifies a given response type identifier and a given action. The localized response document contains a given response entry that specifies the given response type identifier and a given response in a given language. The localized GRXML document specifies a new grammar rule element. The new grammar rule element specifies the given response in the given language and the given action. The localized GRXML document is installed in an interactive voice response (IVR) system. The localized GRXML document configures the IVR system to perform the given action when a user of the IVR system speaks the given response to the IVR system.

152 citations


Proceedings ArticleDOI
23 Aug 2010
TL;DR: PAGE is described, a new XML-based page image representation framework that records information on image characteristics (image borders, geometric distortions and corresponding corrections, binarisation etc.) in addition to layout structure and page content.
Abstract: There is a plethora of established and proposed document representation formats but none that can adequately support individual stages within an entire sequence of document image analysis methods (from document image enhancement to layout analysis to OCR) and their evaluation. This paper describes PAGE, a new XML-based page image representation framework that records information on image characteristics (image borders, geometric distortions and corresponding corrections, binarisation etc.) in addition to layout structure and page content. The suitability of the framework to the evaluation of entire workflows as well as individual stages has been extensively validated by using it in high-profile applications such as in public contemporary and historical ground-truthed datasets and in the ICDAR Page Segmentation competition series.

145 citations


Proceedings Article
Kuansan Wang1, Chris Thrasher1, Evelyne Viegas1, Xiaolong Li1, Bo-June (Paul) Hsu1 
02 Jun 2010
TL;DR: The properties and some applications of the Microsoft Web N-gram corpus are described and the usages of the corpus are demonstrated here in two NLP tasks: phrase segmentation and word breaking.
Abstract: This document describes the properties and some applications of the Microsoft Web N-gram corpus. The corpus is designed to have the following characteristics. First, in contrast to static data distribution of previous corpus releases, this N-gram corpus is made publicly available as an XML Web Service so that it can be updated as deemed necessary by the user community to include new words and phrases constantly being added to the Web. Secondly, the corpus makes available various sections of a Web document, specifically, the body, title, and anchor text, as separates models as text contents in these sections are found to possess significantly different statistical properties and therefore are treated as distinct languages from the language modeling point of view. The usages of the corpus are demonstrated here in two NLP tasks: phrase segmentation and word breaking.

122 citations


Patent
03 Nov 2010
TL;DR: In this paper, the authors propose a dynamic composition system that allows the creation of interoperable combinations of content by the publisher, determined to be an optimal combination, and offer such combinations to client devices in an interoperable way to allow simple selection by devices without complex programming, web pages, etc.
Abstract: The subject disclosure relates to dynamic composition including the ability to create interoperable combinations of content by the publisher, e.g., determined to be an optimal combination, and offer such combinations to client devices in an interoperable way to allow simple selection by devices without complex programming, web pages, etc. specific to each device. Compositions are dynamic in that new audio, video, subtitle, etc. tracks can be added to a given composition without changing any of the other tracks, e.g., by updating the composition's extensible markup language (XML), and new compositions can be created or removed at any time without changing any audio or video files. Interoperable and scalable “discovery” is also enabled whereby random devices can contact a Web server, find and play a composition matched to the given devices and users, e.g., optimal composition for a given device and user. Using the content identification and description format of compositions, devices can search, sort, browse, display, etc. content that is available, determine if it is compatible at the device, decode, and determine digital rights management (DRM) level, and content level.

102 citations


Proceedings ArticleDOI
22 Mar 2010
TL;DR: This work provides a formal semantics for the language and present novel techniques for efficiently evaluating lineage queries and shows that these strategies are feasible and can significantly reduce both provenance storage size and query execution time when compared with standard approaches.
Abstract: A key advantage of scientific workflow systems over traditional scripting approaches is their ability to automatically record data and process dependencies introduced during workflow runs. This information is often represented through provenance graphs, which can be used by scientists to better understand, reproduce, and verify scientific results. However, while most systems record and store data and process dependencies, few provide easy-to-use and efficient approaches for accessing and querying provenance information. Instead, users formulate provenance graph queries directly against physical data representations (e.g., relational, XML, or RDF), leading to queries that are difficult to express and expensive to evaluate. We address these problems through a high-level query language tailored for expressing provenance graph queries. The language is based on a general model of provenance supporting scientific workflows that process XML data and employ update semantics. Query constructs are provided for querying both structure and lineage information. Unlike other languages that return sets of nodes as answers, our query language is closed, i.e., answers to lineage queries are sets of lineage dependencies (edges) allowing answers to be further queried. We provide a formal semantics for the language and present novel techniques for efficiently evaluating lineage queries. Experimental results on real and synthetic provenance traces demonstrate that our lineage based optimizations outperform an in-memory and standard database implementation by orders of magnitude. We also show that our strategies are feasible and can significantly reduce both provenance storage size and query execution time when compared with standard approaches.

95 citations


Proceedings ArticleDOI
01 Mar 2010
TL;DR: A series of join-based algorithms that combine the semantic pruning and the top-K processing to support top- K keyword search in XML databases are proposed and extensive experimental evaluations show the performance advantages.
Abstract: Keyword search is considered to be an effective information discovery method for both structured and semi-structured data. In XML keyword search, query semantics is based on the concept of Lowest Common Ancestor (LCA). However, naive LCA-based semantics leads to exponential computation and result size. In the literature, LCA-based semantic variants (e.g., ELCA and SLCA) were proposed, which define a subset of all the LCAs as the results. While most existing work focuses on algorithmic efficiency, top-K processing for XML keyword search is an important issue that has received very little attention. Existing algorithms focusing on efficiency are designed to optimize the semantic pruning and are incapable of supporting top-K processing. On the other hand, straightforward applications of top-K techniques from other areas (e.g., relational databases) generate LCAs that may not be the results and unnecessarily expand efforts in the semantic pruning. In this paper, we propose a series of join-based algorithms that combine the semantic pruning and the top-K processing to support top-K keyword search in XML databases. The algorithms essentially reduce the keyword query evaluation to relational joins, and incorporate the idea of the top-K join from relational databases. Extensive experimental evaluations show the performance advantages of our algorithms.

94 citations


Patent
19 Jan 2010
TL;DR: A fact extraction tool set (FEX) as discussed by the authorsEX is a pattern matching language which is used to find and match patterns of attributes that correspond to targeted pieces of information in the text, and extract that information.
Abstract: A fact extraction tool set (“FEX”) finds and extracts targeted pieces of information from text using linguistic and pattern matching technologies, and in particular, text annotation and fact extraction. Text annotation tools break a text, such as a document, into its base tokens and annotate those tokens or patterns of tokens with orthographic, syntactic, semantic, pragmatic and other attributes. A user-defined “Annotation Configuration” controls which annotation tools are used in a given application. XML is used as the basis for representing the annotated text. A tag uncrossing tool resolves conflicting (crossed) annotation boundaries in an annotated text to produce well-formed XML from the results of the individual annotators. The fact extraction tool is a pattern matching language which is used to write scripts that find and match patterns of attributes that correspond to targeted pieces of information in the text, and extract that information.

88 citations


Journal ArticleDOI
TL;DR: The NETCONF protocol and a recently introduced NET CONF data modeling language called YANG are described, which allows data modelers to define the syntax and semantics of device configurations, and supports translations to several XML schema languages.
Abstract: The Internet Engineering Task Force has standardized a new network configuration management protocol called NETCONF, which provides mechanisms to install, manipulate, and delete the configuration of network devices. This article describes the NETCONF protocol and a recently introduced NETCONF data modeling language called YANG. The YANG language allows data modelers to define the syntax and semantics of device configurations, and supports translations to several XML schema languages.

86 citations



Proceedings Article
04 Nov 2010
TL;DR: The rationale and design of an XML-TEI encoded corpora compatible analysis platform for text mining called TXM is described, which is based on a synthesis of the best available algorithms in existing textometry software.
Abstract: This paper describes the rationale and design of an XML-TEI encoded corpora compatible analysis platform for text mining called TXM. The design of this platform is based on a synthesis of the best available algorithms in existing textometry software. It also relies on identifying the most relevant open-source technologies for processing textual resources encoded in XML and Unicode, for efficient full-text search on annotated corpora and for statistical data analysis. The architecture is based on a Java toolbox articulating a full-text search engine component with a statistical computing environment and with an original import environment able to process a large variety of data sources, including XML-TEI, and to apply embedded NLP tools to them. The platform is distributed as an open-source Eclipse project for developers and in the form of two demonstrator applications for end users: a standard application to install on a workstation and an online web application framework.

Journal ArticleDOI
30 Jun 2010-ZooKeys
TL;DR: The concept of semantic tagging and its potential for semantic enhancements to taxonomic papers is outlined and illustrated by four exemplar papers published in the present issue of ZooKeys, the first taxonomic journal to provide a complete XML-based editorial, publication and dissemination workflow implemented as a routine and cost-efficient practice.
Abstract: The concept of semantic tagging and its potential for semantic enhancements to taxonomic papers is outlined and illustrated by four exemplar papers published in the present issue of ZooKeys. The four papers were created in different ways: (i) written in Microsoft Word and submitted as non-tagged manuscript (doi: 10.3897/zookeys.50.504); (ii) generated from Scratchpads and submitted as XML-tagged manuscripts (doi: 10.3897/zookeys.50.505 and doi: 10.3897/zookeys.50.506); (iii) generated from an author’s database (doi: 10.3897/zookeys.50.485) and submitted as XML-tagged manuscript. XML tagging and semantic enhancements were implemented during the editorial process of ZooKeys using the Pensoft Mark Up Tool (PMT), specially designed for this purpose. The XML schema used was TaxPub, an extension to the Document Type Definitions (DTD) of the US National Library of Medicine Journal Archiving and Interchange Tag Suite (NLM). The following innovative methods of tagging, layout, publishing and disseminating the content were tested and implemented within the ZooKeys editorial workflow: (1) highly automated, fine-grained XML tagging based on TaxPub; (2) final XML output of the paper validated against the NLM DTD for archiving in PubMedCentral; (3) bibliographic metadata embedded in the PDF through XMP (Extensible Metadata Platform); (4) PDF uploaded after publication to the Biodiversity Heritage Library (BHL); (5) taxon treatments supplied through XML to Plazi; (6) semantically enhanced HTML version of the paper encompassing numerous internal and external links and linkouts, such as: (i) vizualisation of main tag elements within the text (e.g., taxon names, taxon treatments, localities, etc.); (ii) internal cross-linking between paper sections, citations, references, tables, and figures; (iii) mapping of localities listed in the whole paper or within separate taxon treatments; (v) taxon names autotagged, dynamically mapped and linked through the Pensoft Taxon Profile (PTP) to large international database services and indexers such as Global Biodiversity Information Facility (GBIF), National Center for Biotechnology Information (NCBI), Barcode of Life (BOLD), Encyclopedia of Life (EOL), ZooBank, Wikipedia, Wikispecies, Wikimedia, and others; (vi) GenBank accession numbers autotagged and linked to NCBI; (vii) external links of taxon names to references in PubMed, Google Scholar, Biodiversity Heritage Library and other sources. With the launching of the working example, ZooKeys becomes the first taxonomic journal to provide a complete XML-based editorial, publication and dissemination workflow implemented as a routine and cost-efficient practice. It is anticipated that XML-based workflow will also soon be implemented in botany through PhytoKeys, a forthcoming partner journal of ZooKeys. The semantic markup and enhancements are expected to greatly extend and accelerate the way taxonomic information is published, disseminated and used.

Proceedings ArticleDOI
01 Oct 2010
TL;DR: The model was developed as part of the FP7 ICT Integrated Project SLA@SOI, and has been applied to a range of industrial use-cases, including; ERP hosting, Enterprise IT, live-media streaming and health-care provision.
Abstract: This paper describes SLA★, a domain-independent syntax for machine-readable Service Level Agreements (SLAs) and SLA templates. Historically, SLA★ was developed as a generalisation and refinement of the web-service specific XML standards: WS-Agreement, WSLA, and WSDL. Instead of web-services, however, SLA★ deals with services in general, and instead of XML, it is language independent. SLA★ provides a specification of SLA(T) content at a fine-grained level of detail, which is both richly expressive and inherently extensible: supporting controlled customisation to arbitrary domain-specific requirements. The model was developed as part of the FP7 ICT Integrated Project SLA@SOI, and has been applied to a range of industrial use-cases, including; ERP hosting, Enterprise IT, live-media streaming and health-care provision. At the time of writing, the abstract syntax has been realised in concrete form as a Java API, XML-Schema, and BNF Grammar.

Proceedings ArticleDOI
05 May 2010
TL;DR: A light weight XML based context representation schema called ContextML is presented, in which context information is categorized into scopes and related to different types of entities (e.g. user, device), to allow for a flexible framework supporting gradual plug & play extendibility and mobility.
Abstract: Context representation is a fundamental process in developing context aware systems for the pervasive world. We present a light weight XML based context representation schema called ContextML in which context information is categorized into scopes and related to different types of entities (e.g. user, device). The schema is also applied for encoding management messages in order to allow for a flexible framework supporting gradual plug & play extendibility and mobility. ContextML is tailored to be used for REST-based communication between the framework components. Explanation of the schema is provided with the help of real world examples. Moreover, the European C-CAST testbed is introduced, embracing a variety of context providers and application domains.

Patent
28 May 2010
TL;DR: In this article, the authors propose a system and methods for the delivery of user-controlled resources in cloud environments via a resource specification language wrapper, such as an XML (extensible markup language) wrapper.
Abstract: Embodiments relate to systems and methods for the delivery of user-controlled resources in cloud environments via a resource specification language wrapper. In embodiments, the user of a client machine may wish to contribute resources from that machine to a cloud-based network via a network connection over a limited or defined period. To expose the user-controlled resources to one or more clouds for use the user may transmit a contribution request encoding the user-controlled resources in a specification language wrapper, such as an XML (extensible markup language) wrapper. The specification language wrapper can embed the set of user-controlled resources, such as processor time, memory, and/or other resources, in an XML or other format to transmit to a marketplace engine which can place the set of user-controlled resources into a resource pool, for selection by marketplace clouds. The specification language wrapper can indicate access controls or restrictions on the contributed resources.

Journal ArticleDOI
TL;DR: The query-by-keyword paradigm has emerged due to the desire to search multimedia content in terms of semantic concepts using keywords or sentences rather than low-level multimedia descriptors.
Abstract: Early prototype multimedia database management systems used the query-by-example paradigm to respond to user queries. Users needed to formulate their queries by providing examples or sketches. The query-by-keyword paradigm, on the other hand, has emerged due to the desire to search multimedia content in terms of semantic concepts using keywords or sentences rather than low-level multimedia descriptors. After all, it's much easier to formulate some queries by using keywords. However, some queries are still easier to formulate by examples or sketches-for example, the trajectory of a moving object.

Journal ArticleDOI
TL;DR: This paper classify, review, and experimentally compare major methods of element similarity measures and their combinations, and aims at presenting a unified view which is useful when developing a new element similarity measure, when implementing an XML schema matching component, when using an XMLschema matching system, and when comparing XML Schema matching systems.

Journal ArticleDOI
TL;DR: The Systems Biology Results Markup Language (SBRML) is proposed, an XML-based language that associates a model with several datasets and provides a flexible way of indexing the results to model parameter values, which supports both spreadsheet-like data and multidimensional data cubes.
Abstract: MOTIVATION: Research in systems biology is carried out through a combination of experiments and models. Several data standards have been adopted for representing models (Systems Biology Markup Language) and various types of relevant experimental data (such as FuGE and those of the Proteomics Standards Initiative). However, until now, there has been no standard way to associate a model and its entities to the corresponding datasets, or vice versa. Such a standard would provide a means to represent computational simulation results as well as to frame experimental data in the context of a particular model. Target applications include model-driven data analysis, parameter estimation, and sharing and archiving model simulations. RESULTS: We propose the Systems Biology Results Markup Language (SBRML), an XML-based language that associates a model with several datasets. Each dataset is represented as a series of values associated with model variables, and their corresponding parameter values. SBRML provides a flexible way of indexing the results to model parameter values, which supports both spreadsheet-like data and multidimensional data cubes. We present and discuss several examples of SBRML usage in applications such as enzyme kinetics, microarray gene expression and various types of simulation results.

Journal ArticleDOI
01 Sep 2010
TL;DR: A query compilation technique is reported on that enables the construction of alternative efficient query providers for Microsoft's Language Integrated Query (LINQ) framework that faithfully preserve list order and nesting, both being core features of the LINQ data model.
Abstract: We report on a query compilation technique that enables the construction of alternative efficient query providers for Microsoft's Language Integrated Query (LINQ) framework. LINQ programs are mapped into an intermediate algebraic form, suitable for execution on any SQL:1999-capable relational database system.This compilation technique leads to query providers that (1) faithfully preserve list order and nesting, both being core features of the LINQ data model, (2) support the complete family of LINQ's Standard Query Operators, (3) bring database support to LINQ to XML where the original provider performs in-memory query evaluation, and, most importantly, (4) emit SQL statement sequences whose size is only determined by the input query's result type (and thus independent of the database size).A sample query scenario uses this LINQ provider to marry database-resident TPC-H and XMark data---resulting in a unique query experience that exhibits quite promising performance characteristics, especially for large data instances.

Journal ArticleDOI
TL;DR: A systematic survey of the state of the art MPEG-7 based multimedia ontologies is presented, and issues that hinder interoperability as well as possible directions towards their harmonisation are highlighted.
Abstract: Machine understandable metadata forms the main prerequisite for the intelligent services envisaged in a Web, which going beyond mere data exchange and provides for effective content access, sharing and reuse. MPEG-7, despite providing a comprehensive set of tools for the standardised description of audiovisual content, is largely compromised by the use of XML that leaves the largest part of the intended semantics implicit. Aspiring to formalise MPEG-7 descriptions and enhance multimedia metadata interoperability, a number of multimedia ontologies have been proposed. Though sharing a common vision, the developed ontologies are characterised by substantial conceptual differences, reflected both in the modelling of MPEG-7 description tools as well as in the linking with domain ontologies. Delving into the principles underlying their engineering, we present a systematic survey of the state of the art MPEG-7 based multimedia ontologies, and highlight issues that hinder interoperability as well as possible directions towards their harmonisation.

Journal ArticleDOI
TL;DR: The first task of the arXMLiv project is to develop LaTeXML bindings for the (thousands of) LaTEX classes and packages used in the arχiv collection, as well as methods for coping with the eccentricities that TEX encourages.
Abstract: We describe an experiment transforming large collections of LaTEX documents to more machine-understandable representations. Concretely, we are translating the collection of scientific publications of the Cornell e-Print Archive (arχiv) using LaTeXML, a LaTEX to XML converter currently under development. While the long-term goal is a large body of scientific documents available for semantic analysis, search indexing and other experimentation, the immediate goals are tools for creating such corpora. The first task of our arXMLiv project is to develop LaTeXML bindings for the (thousands of) LaTEX classes and packages used in the arχiv collection, as well as methods for coping with the eccentricities that TEX encourages. We have created a distributed build system that runs LaTeXML over the collection, in part or entirely, while collecting statistics about missing bindings and other errors. This guides debugging and development efforts, leading to iterative improvements in both the tools and the quality of the converted corpus. The build system thus serves as both a production conversion engine and software test harness. We have now processed the complete arχiv collection through 2006 consisting of more than 400,000 documents (a complete run is a processor-year-size undertaking), continuously improving our success rate. We are now able to convert more than 90% of these documents to XHTML+MathML. We consider over 60% to be successes, converted with no or minor warnings. While the remaining 30% can also be converted, their quality is doubtful, due to unsupported macros or conversion errors.

Journal ArticleDOI
TL;DR: An IR-style approach which basically utilizes the statistics of underlying XML data to address keyword search challenges and designs novel formulae to identify the search for nodes and search via nodes of a query, and presents a novel XML TF*IDF ranking strategy to rank the individual matches of all possible search intentions.
Abstract: Inspired by the great success of information retrieval (IR) style keyword search on the web, keyword search on XML has emerged recently. The difference between text database and XML database results in three new challenges: 1) Identify the user search intention, i.e., identify the XML node types that user wants to search for and search via. 2) Resolve keyword ambiguity problems: a keyword can appear as both a tag name and a text value of some node; a keyword can appear as the text values of different XML node types and carry different meanings; a keyword can appear as the tag name of different XML node types with different meanings. 3) As the search results are subtrees of the XML document, new scoring function is needed to estimate its relevance to a given query. However, existing methods cannot resolve these challenges, thus return low result quality in term of query relevance. In this paper, we propose an IR-style approach which basically utilizes the statistics of underlying XML data to address these challenges. We first propose specific guidelines that a search engine should meet in both search intention identification and relevance oriented ranking for search results. Then, based on these guidelines, we design novel formulae to identify the search for nodes and search via nodes of a query, and present a novel XML TF*IDF ranking strategy to rank the individual matches of all possible search intentions. To complement our result ranking framework, we also take the popularity into consideration for the results that have comparable relevance scores. Lastly, extensive experiments have been conducted to show the effectiveness of our approach.

Journal Article
TL;DR: Invited.- Is There Something Quantum-Like about the Human Mental Lexicon?
Abstract: Invited.- Is There Something Quantum-Like about the Human Mental Lexicon?.- Supporting for Real-World Tasks: Producing Summaries of Scientific Articles Tailored to the Citation Context.- Semantic Document Processing Using Wikipedia as a Knowledge Base.- Ad Hoc Track.- Overview of the INEX 2009 Ad Hoc Track.- Analysis of the INEX 2009 Ad Hoc Track Results.- ENSM-SE at INEX 2009 : Scoring with Proximity and Semantic Tag Information.- LIP6 at INEX'09: OWPC for Ad Hoc Track.- A Methodology for Producing Improved Focused Elements.- ListBM: A Learning-to-Rank Method for XML Keyword Search.- UJM at INEX 2009 Ad Hoc Track.- Language Models for XML Element Retrieval.- Use of Language Model, Phrases and Wikipedia Forward Links for INEX 2009.- Parameter Tuning in Pivoted Normalization for XML Retrieval: ISI@INEX09 Adhoc Focused Task.- Combining Language Models with NLP and Interactive Query Expansion.- Exploiting Semantic Tags in XML Retrieval.- Book Track.- Overview of the INEX 2009 Book Track.- XRCE Participation to the 2009 Book Structure Task.- The Book Structure Extraction Competition with the Resurgence Software at Caen University.- Ranking and Fusion Approaches for XML Book Retrieval.- OUC's Participation in the 2009 INEX Book Track.- Efficiency Track.- Overview of the INEX 2009 Efficiency Track.- Index Tuning for Efficient Proximity-Enhanced Query Processing.- TopX 2.0 at the INEX 2009 Ad-Hoc and Efficiency Tracks.- Fast and Effective Focused Retrieval.- Achieving High Precisions with Peer-to-Peer Is Possible!.- Entity Ranking Track.- Overview of the INEX 2009 Entity Ranking Track.- Combining Term-Based and Category-Based Representations for Entity Search.- Focused Search in Books and Wikipedia: Categories, Links and Relevance Feedback.- A Recursive Approach to Entity Ranking and List Completion Using Entity Determining Terms, Qualifiers and Prominent n-Grams.- Interactive Track.- Overview of the INEX 2009 Interactive Track.- Link the Wiki Track.- Overview of the INEX 2009 Link the Wiki Track.- An Exploration of Learning to Link with Wikipedia: Features, Methods and Training Collection.- University of Waterloo at INEX 2009: Ad Hoc, Book, Entity Ranking, and Link-the-Wiki Tracks.- A Machine Learning Approach to Link Prediction for Interlinked Documents.- Question Answering Track.- Overview of the 2009 QA Track: Towards a Common Task for QA, Focused IR and Automatic Summarization Systems.- XML Mining Track.- Overview of the INEX 2009 XML Mining Track: Clustering and Classification of XML Documents.- Exploiting Index Pruning Methods for Clustering XML Collections.- Multi-label Wikipedia Classification with Textual and Link Features.- Link-Based Text Classification Using Bayesian Networks.- Clustering with Random Indexing K-tree and XML Structure.- Utilising Semantic Tags in XML Clustering.- UJM at INEX 2009 XML Mining Track.- BUAP: Performance of K-Star at the INEX'09 Clustering Task.- Extended VSM for XML Document Classification Using Frequent Subtrees.- Supervised Encoding of Graph-of-Graphs for Classification and Regression Problems.

Proceedings Article
01 May 2010
TL;DR: A GATE resource called the OwlExporter is developed that allows to easily map existing NLP analysis pipelines to OWL ontologies, thereby allowing language engineers to create ontology population systems without requiring extensive knowledge of ontology APIs.
Abstract: Ontology population from text is becoming increasingly important for NLP applications Ontologies in OWL format provide for a standardized means of modeling, querying, and reasoning over large knowledge bases Populated from natural language texts, they offer significant advantages over traditional export formats, such as plain XML The development of text analysis systems has been greatly facilitated by modern NLP frameworks, such as the General Architecture for Text Engineering (GATE) However, ontology population is not currently supported by a standard component We developed a GATE resource called the OwlExporter that allows to easily map existing NLP analysis pipelines to OWL ontologies, thereby allowing language engineers to create ontology population systems without requiring extensive knowledge of ontology APIs A particular feature of our approach is the concurrent population and linking of a domainand NLP-ontology, including NLP-specific features such as safe reasoning over coreference chains

Proceedings ArticleDOI
06 Jun 2010
TL;DR: A generalization from string to trees and from languages to translations is given of the classical result that any regular language can be learned from examples: it is shown that for any deterministic top-down tree transformation there exists a sample set of polynomial size which allows to infer the translation.
Abstract: A generalization from string to trees and from languages to translations is given of the classical result that any regular language can be learned from examples: it is shown that for any deterministic top-down tree transformation there exists a sample set of polynomial size (with respect to the minimal transducer) which allows to infer the translation. Until now, only for string transducers and for simple relabeling tree transducers, similar results had been known. Learning of deterministic top-down tree transducers (dtops) is far more involved because a dtop can copy, delete, and permute its input subtrees. Thus, complex dependencies of labeled input to output paths need to be maintained by the algorithm. First, a Myhill-Nerode theorem is presented for dtops, which is interesting on its own. This theorem is then used to construct a learning algorithm for dtops. Finally, it is shown how our result can be applied to xml transformations (e.g. xslt programs). For this, a new dtd-based encoding of unranked trees by ranked ones is presented. Over such encodings, dtops can realize many practically interesting xml transformations which cannot be realized on firstchild/next-sibling encodings.

Patent
08 Nov 2010
TL;DR: In this article, a unified web-based voice messaging system provides voice application control between a web browser and an application server via an hypertext transport protocol (HTTP) connection on an Internet Protocol (IP) network.
Abstract: A unified web-based voice messaging system provides voice application control between a web browser and an application server via an hypertext transport protocol (HTTP) connection on an Internet Protocol (IP) network. The application server, configured for executing a voice application defined by XML documents, selects an XML document for execution of a corresponding voice application operation based on a determined presence of a user-specific XML document that specifies the corresponding voice application operation. The application server, upon receiving a voice application operation request from a browser serving a user, determines whether a personalized, user specific XML document exists for the user and for the corresponding voice application operation. If the application server determines the presence of the personalized XML document for a user-specific execution of the corresponding voice application operation, the application server dynamically generates a personalized HTML page having media content and control tags for personalized execution of the voice application operation; however if the application server determines an absence of the personalized XML document for the user-specific execution of the corresponding voice application operation, the application server dynamically generates a generic HTML page for generic execution of the voice application operation. Hence, a user can personalize any number of voice application operations, enabling a web-based voice application to be completely customized or merely partially customized.

Journal ArticleDOI
TL;DR: Experiments with the Amazon E-Commerce Service demonstrate the advantages of using a model-based approach for the runtime testing and monitoring of Web applications.
Abstract: Asynchronous JavaScript and XML (Ajax) is a collection of technologies used to develop rich and interactive Web applications. A typical Ajax client runs locally in the user's Web browser and refreshes its interface on the fly in response to user input. Using this method with the AWS-ECS let us automatically generate test sequences and detect two deviations of their service implementation with respect to the online documentation provided, in less than three minutes of testing. We also provided a framework that allows the runtime monitoring of both client and server contract constraints with minimal modification to an existing Ajax application code. Experiments with the Amazon E-Commerce Service demonstrate the advantages of using a model-based approach for the runtime testing and monitoring of Web applications.

Patent
15 Sep 2010
TL;DR: In this article, the authors proposed a method for finding hot spots from magnanimous information in the Internet, where a series of RSS seeds are captured by a network server through analysis in the XML way, and the hypertext markup language of web pages is captured by the reptile technology through the information extraction algorithm.
Abstract: The invention provides a method for finding hot spots from magnanimous information in the Internet. The method is characterized in that a series of RSS seeds are captured by a network server through analysis in the XML way, and the hypertext markup language of web pages is captured by the reptile technology through the information extraction algorithm, the structured field information of web pages is obtained, the renewal frequency in view of different websites is set, the hot spot degree of web pages is calculated according to parameters such as own renewal rate of websites, authority indexes, information position in the linkout web pages, issue time and click number, and the processes of ordering and recommendation are performed. The invention can help users to find hot spots from magnanimous information in the Internet, so that the efficiency of interested information by the users is obviously promoted.

Proceedings ArticleDOI
05 Aug 2010
TL;DR: The results indicate that the cosine similarity measure is superior than the other measures such as Jaccard measure, Euclidean measure that were tested and is particularly better for text documents.
Abstract: This paper presents the results of an experimental study of some similarity measures used for both Information Retrieval and Document Clustering. Our results indicate that the cosine similarity measure is superior than the other measures such as Jaccard measure, Euclidean measure that we tested. Cosine Similarity measure is particularly better for text documents. Previously these measures are compared with the conventional text datasets but the proposed system collects the datasets with the help of API and it returns the collection of XML pages. These XML pages are parsed and filtered to get the web document datasets. In this paper, we compare and analyze the effectiveness of these measures for these web document datasets.