scispace - formally typeset
Search or ask a question

Showing papers on "Metadata repository published in 2015"


Journal ArticleDOI
TL;DR: The overall workflow architecture of CERMINE is outlined, details about individual steps implementations are provided and the evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types.
Abstract: CERMINE is a comprehensive open-source system for extracting structured metadata from scientific articles in a born-digital form. The system is based on a modular workflow, whose loosely coupled architecture allows for individual component evaluation and adjustment, enables effortless improvements and replacements of independent parts of the algorithm and facilitates future architecture expanding. The implementations of most steps are based on supervised and unsupervised machine learning techniques, which simplifies the procedure of adapting the system to new document layouts and styles. The evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types, with the average F score of 77.5 %. CERMINE system is available under an open-source licence and can be accessed at http://cermine.ceon.pl. In this paper, we outline the overall workflow architecture and provide details about individual steps implementations. We also thoroughly compare CERMINE to similar solutions, describe evaluation methodology and finally report its results.

164 citations


Journal ArticleDOI
TL;DR: The Center for Expanded Data Annotation and Retrieval is studying the creation of comprehensive and expressive metadata for biomedical datasets to facilitate data discovery, data interpretation, and data reuse.

90 citations


Patent
31 Mar 2015
TL;DR: Metadata management convergence platforms, systems, and methods to organize a community of users' data records as mentioned in this paper have been proposed to manage metadata records related to content housed in unique, disparate or federated holdings in centralized or distributed environments, including vehicle fleet information systems, government document holdings, insurance and underwriting information holdings; academic library collections; and entertainment archives.
Abstract: Metadata management convergence platforms, systems, and methods to organize a community of users' data records. More specifically, methods managing metadata records related to content housed in unique, disparate or federated holdings in centralized or distributed environments. Also systems and methods for creating and managing metadata records using domain specific language, vocabulary and metadata schema accepted by a community of users of unique, disparate or federated databases in centralized or distributed environments. Such environments can include content repositories including but not limited to: vehicle fleet information systems; government document holdings; insurance and underwriting information holdings; academic library collections; and entertainment archives.

84 citations


Proceedings ArticleDOI
26 Aug 2015
TL;DR: This paper presents Personal Data Lake, a unified storage facility for storing, analyzing and querying personal data, and allows third-party plugins so that the unstructured data can be analyzed and queried.
Abstract: This paper presents Personal Data Lake, a unified storage facility for storing, analyzing and querying personal data. A data lake stores data regardless of format and thus provides an intuitive way to store personal data fragments of any type. Metadata management is a central part of the lake architecture. For structured/semi-structured data fragments, metadata may contain information about the schema of the data so that the data can be transformed into queryable data objects when required. For unstructured data, enabling gravity pull means allowing third-party plugins so that the unstructured data can be analyzed and queried.

80 citations


Patent
05 Feb 2015
TL;DR: In this article, a secondary indexing technique cooperates with primary indices of an indexing arrangement to enable efficient storage and access of metadata used to retrieve packets persistently stored in data files of a data repository.
Abstract: A secondary indexing technique cooperates with primary indices of an indexing arrangement to enable efficient storage and access of metadata used to retrieve packets persistently stored in data files of a data repository. Efficient storage and access of the metadata used to retrieve the persistently stored packets may be based on a target value of the packets over a search time window. The metadata is illustratively organized as a metadata repository of primary index files that store the primary indices containing hash values of network flows of the packets, as well as offsets and paths to those packets stored in the data files. The technique includes one or more secondary indices having a plurality of present bits arranged in a binary format (i.e., a bit array) to indicate the presence of the target value in one or more packets stored in the data files over the search time window. Notably, the present bits may be used to reduce (i.e., “prune”) a relatively large search space of the stored packets (e.g., defined by the hash values) to a pruned search space of only those data files in which packets having the target value are stored.

77 citations


Patent
23 Feb 2015
TL;DR: In this paper, the authors propose a context generator that utilizes contextual metadata to identify relationships between data and enable the proactive presentation of data relevant to a user's current context, and a unified activity feed that comprises correlated data groupings identified by correlation engines.
Abstract: A unified experience environment supports mechanisms that collect and utilize contextual metadata to associate information in accordance with its relevance to a user's current context. An ambient data collector obtains contextual and activity information coincident with a user's creation, editing or consumption of data and associates it with such data as contextual metadata. A context generator that utilizes contextual metadata to identify relationships between data and enable the proactive presentation of data relevant to a user's current context. Proactive presentation includes a context panel that is alternatively displayable and hideable in an application-independent manner and a unified activity feed that comprises correlated data groupings identified by correlation engines, including a universal, cross-application correlation engine and individual, application-specific correlation engines that exchange information through data correlation interfaces. The context panel and unified activity feed enable users to more efficiently access data and increase their interaction performance with a computing device.

76 citations


Patent
14 Apr 2015
TL;DR: In this paper, a secure relational file system (SRFS) for storing and managing data for backup and restore is presented, in which the first metadata including file-to-sector mapping information, splits the data into fixed sized data chunks (FSDCs), generates second metadata including logical boundaries used for splitting, creates variable sized data blocks (VSDBs), and stores the UVSDCs in chunk files.
Abstract: A computer implemented method and a secure relational file system (SRFS) for storing and managing data for backup and restore are provided. The SRFS receives data, generates first metadata including file-to-sector mapping information, splits the data into fixed sized data chunks (FSDCs), generates second metadata including logical boundaries used for splitting, creates fixed sized data blocks by prepending the second metadata to the FSDCs, splits each FSDC into variable sized data chunks (VSDCs), generates third metadata including unique identifiers (UIDs) for the VSDCs, creates variable sized data blocks (VSDBs) by prepending the third metadata and the second metadata to each VSDC, identifies unique variable sized data chunks (UVSDCs) of the VSDBs using the UIDs, stores the UVSDCs in chunk files, and stores the first metadata, the second metadata extracted from the VSDBs, and storage locations of the UVSDCs with the third metadata of the UVSDCs and duplicate VSDCs in databases.

59 citations


Patent
31 Mar 2015
TL;DR: In this paper, a client request, formatted in accordance with a file system interface, is received at an access subsystem of a distributed multi-tenant storage service, and an atomic metadata operation comprising a group of file system metadata modifications is initiated.
Abstract: A client request, formatted in accordance with a file system interface, is received at an access subsystem of a distributed multi-tenant storage service. After the request is authenticated at the access subsystem, an atomic metadata operation comprising a group of file system metadata modifications is initiated, including a first metadata modification at a first node of a metadata subsystem of the storage service and a second metadata modification at a second node of the metadata subsystem. A plurality of replicas of at least one data modification corresponding to the request are saved at respective storage nodes of the service.

56 citations


Patent
13 Apr 2015
TL;DR: In this article, an initial backup of a volume is created at a backup server, where creating the initial backup includes retrieving an original metadata file from a metadata server, and retrieving a copy of all data of the volume based on the original file.
Abstract: Disclosed are systems, computer-readable mediums, and methods for incremental block level backup. An initial backup of a volume is created at a backup server, where creating the initial backup includes retrieving an original metadata file from a metadata server, and retrieving a copy of all data of the volume based on the original metadata file. A first incremental backup of the volume is then created at the backup server, where creating the first incremental backup includes retrieving a first metadata file, where the first metadata file was created separately from the original metadata file. A block identifier of the first metadata file is compared to a corresponding block identifier of the original metadata file to determine a difference between the first and original block identifiers, and a copy of a changed data block of the volume is retrieved based on the comparison of the first and original block identifiers.

48 citations


Patent
16 Jul 2015
TL;DR: In this paper, a system that generates a visualization user interface using metadata is presented, and the system parses the visualization template for the metadata, and replaces the metadata with binding between a visualization component and the data source.
Abstract: A system that generates a visualization user interface. The system receives a selection of a data source, and receives a selection of a visualization template that includes metadata. The system further receives a selection of data attributes corresponding to the data source. The system parses the visualization template for the metadata, and replaces the metadata with binding between a visualization component and the data source. The system then generates the visualization user interface using the visualization component.

48 citations


Journal ArticleDOI
TL;DR: A natural language processing method is employed, namely Labeled Latent Dirichlet Allocation (LLDA), and a regression model is trained via a human participants experiment to address the topic heterogeneity brought by multiple metadata standards and the lack of established semantic search in Linked‐Data‐driven geoportals.
Abstract: Geoportals provide integrated access to geospatial resources, and enable both authorities and the general public to contribute and share data and services. An essential goal of geoportals is to facilitate the discovery of the available resources. Such process heavily relies on the quality of metadata. While multiple metadata standards have been established, data contributers may adopt different standards when sharing their data via the same geoportal. This is especially the case for user-generated content where various terms and topics can be introduced to describe similar datasets. While this heterogeneity provides a wealth of perspectives, it also complicates resource discovery. With the fast development of the Semantic Web technologies, there is a rise of Linked-Data-driven portals. Although these novel portals open up new ways to organizing metadata and retrieving resources, they lack effective semantic search methods. This paper addresses the two challenges discussed above, namely the topic heterogeneity brought by multiple metadata standards as well as the lack of established semantic search in Linked-Data-driven geoportals. To harmonize the metadata topics, we employ a natural language processing method, namely Labeled Latent Dirichlet Allocation (LLDA), and train it using standardized metadata from Data.gov. With respect to semantic search, we construct thematic and geographic matching features from the textual metadata descriptions, and train a regression model via a human participants experiment. We evaluate our methods by examining their performances in addressing the two issues. Finally, we implement a semantics-enabled and Linked-Data-driven prototypical geoportal using a sample dataset from Esri’s ArcGIS Online.


Proceedings ArticleDOI
15 Nov 2015
TL;DR: A programmable storage system that lets the designer inject custom balancing logic is introduced that shows the flexibility and transparency of this approach by replicating the strategy of a state-of-the-art metadata balancer and concludes by comparing this strategy to other custom balancers on the same system.
Abstract: Migrating resources is a useful tool for balancing load in a distributed system, but it is difficult to determine when to move resources, where to move resources, and how much of them to move. We look at resource migration for file system metadata and show how CephFS's dynamic subtree partitioning approach can exploit varying degrees of locality and balance because it can partition the namespace into variable sized units. Unfortunately, the current metadata balancer is complicated and difficult to control because it struggles to address many of the general resource migration challenges inherent to the metadata management problem. To help decouple policy from mechanism, we introduce a programmable storage system that lets the designer inject custom balancing logic. We show the flexibility and transparency of this approach by replicating the strategy of a state-of-the-art metadata balancer and conclude by comparing this strategy to other custom balancers on the same system.

Patent
16 May 2015
TL;DR: In this article, a data warehouse for heterogeneous imagery data of all varieties, from any configured sources, is maintained to a Data Warehouse for expedient access and convenient search processing.
Abstract: Heterogeneous imagery data of all varieties, from any configured sources, is maintained to a data warehouse for expedient access and convenient search processing. Imagery content maintained is processed for deriving associated search schema including multiple types of metadata, cross reference information for conclusively associating metadata, and diagnostics information for associating metadata with potential correlation. Collection processing governs contents of the warehouse, and is fully configurable to adapt to small customized installations as well as meeting scale requirements of a world population. Client processing provides a variety of useful searches, many options for processing imagery objects, and enables clients to contribute to objects collected for enhancing a collaborative social experience for the benefit of all users.

Journal ArticleDOI
01 Dec 2015
TL;DR: The metadata schema was extensively revised based on the evaluation results, and the new element definitions from the revised schema are presented in this article.
Abstract: Despite increasing interest in and acknowledgment of the significance of video games, current descriptive practices are not sufficiently robust to support searching, browsing, and other access behaviors from diverse user groups. To address this issue, the Game Metadata Research Group at the University of Washington Information School, in collaboration with the Seattle Interactive Media Museum, worked to create a standardized metadata schema. This metadata schema was empirically evaluated using multiple approaches-collaborative review, schema testing, semi-structured user interview, and a large-scale survey. Reviewing and testing the schema revealed issues and challenges in sourcing the metadata for particular elements, determining the level of granularity for data description, and describing digitally distributed games. The findings from user studies suggest that users value various subject and visual metadata, information about how games are related to each other, and data regarding game expansions/alterations such as additional content and networked features. The metadata schema was extensively revised based on the evaluation results, and we present the new element definitions from the revised schema in this article. This work will serve as a platform and catalyst for advances in the design and use of video game metadata.

Journal ArticleDOI
TL;DR: This paper studies uncertain features in the generation and application of metadata, and two types of uncertainties (incomplete and imprecise) are described based on semantic quantitative measurement method semantic relationship quantitative measurement based on possibilistic logic and probability statistic (SRQ-PP).
Abstract: Metadata are the information about and description of data. In Digital Earth, metadata become variant and heterogeneous with many uncertainties. This paper studies uncertain features in the generation and application of metadata, and two types of uncertainties (incomplete and imprecise) are described based on semantic quantitative measurement method semantic relationship quantitative measurement based on possibilistic logic and probability statistic (SRQ-PP). Moreover, in the case study, we apply two types of quantitative measurements based on SRQ-PP to describe incomplete (uncertain) knowledge and imprecise (vague) information separately in spatial data service retrieval, which in turn is helpful to identify additional potential data resources and provide a quantitative analysis of the results.

Patent
28 Jul 2015
TL;DR: In this paper, a computing device operates to determine one or more filters for the set of metadata, and a metadata from the selected metadata is selected based on the one or multiple filters.
Abstract: A computing device operates to receive, from at least a first peer device, a set of metadata that includes one or more identifiers to media playback resources. The computing device operates to determine one or more filters for the set of metadata. A metadata from the set of metadata is selected based on the one or more filters. A search request is provided to a network service for a media playback resource based on the selected metadata.

Patent
22 Dec 2015
TL;DR: In this article, the volume layer of a storage I/O stack executing on one or more nodes of a cluster is represented as mappings from addresses, i.e., logical block addresses (LBAs), of a logical unit (LUN) accessible by a host to durable extent keys maintained by an extent store layer.
Abstract: The embodiments described herein are directed to an organization of metadata managed by a volume layer of a storage input/output (I/O) stack executing on one or more nodes of a cluster. The metadata managed by the volume layer, i.e., the volume metadata, is illustratively embodied as mappings from addresses, i.e., logical block addresses (LBAs), of a logical unit (LUN) accessible by a host to durable extent keys maintained by an extent store layer of the storage I/O stack. In an embodiment, the volume layer organizes the volume metadata as a mapping data structure, i.e., a dense tree metadata structure, which represents successive points in time to enable efficient access to the metadata.

Journal ArticleDOI
TL;DR: This work proposes a methodology that learns the annotation from well-annotated collections of metadata records to automatically annotate poorly annotated ones, and presents two variants of an algorithm for automatic tag recommendation.
Abstract: Ecological and environmental sciences have become more advanced and complex, requiring observational and experimental data from multiple places, times, and thematic scales to verify their hypotheses. Over time, such data have not only increased in amount, but also in diversity and heterogeneity of the data sources that spread throughout the world. This heterogeneity poses a huge challenge for scientists who have to manually search for desired data. ONEMercury has recently been implemented as part of the DataONE project to alleviate such problems and to serve as a portal for accessing environmental and observational data across the globe. ONEMercury harvests metadata records from multiple archives and repositories, and makes them searchable. However, harvested metadata records sometimes are poorly annotated or lacking meaningful keywords, which could impede effective retrieval. We propose a methodology that learns the annotation from well-annotated collections of metadata records to automatically annotate poorly annotated ones. The problem is first transformed into the tag recommendation problem with a controlled tag library. Then, two variants of an algorithm for automatic tag recommendation are presented. The experiments on four datasets of environmental science metadata records show that our methods perform well and also shed light on the natures of different datasets. We also discuss relevant topics such as using topical coherence to fine-tune parameters and experiments on cross-archive annotation.

Journal ArticleDOI
TL;DR: The paper defined a metadata and data description format, called “Togo Metabolome Data” (TogoMD), with an ID system that is required for unique access to each level of the tree-structured metadata such as study purpose, sample, analytical method, and data analysis.
Abstract: Metabolomics - technology for comprehensive detection of small molecules in an organism - lags behind the other "omics" in terms of publication and dissemination of experimental data. Among the reasons for this are difficulty precisely recording information about complicated analytical experiments (metadata), existence of various databases with their own metadata descriptions, and low reusability of the published data, resulting in submitters (the researchers who generate the data) being insufficiently motivated. To tackle these issues, we developed Metabolonote, a Semantic MediaWiki-based database designed specifically for managing metabolomic metadata. We also defined a metadata and data description format, called "Togo Metabolome Data" (TogoMD), with an ID system that is required for unique access to each level of the tree-structured metadata such as study purpose, sample, analytical method, and data analysis. Separation of the management of metadata from that of data and permission to attach related information to the metadata provide advantages for submitters, readers, and database developers. The metadata are enriched with information such as links to comparable data, thereby functioning as a hub of related data resources. They also enhance not only readers' understanding and use of data but also submitters' motivation to publish the data. The metadata are computationally shared among other systems via APIs, which facilitate the construction of novel databases by database developers. A permission system that allows publication of immature metadata and feedback from readers also helps submitters to improve their metadata. Hence, this aspect of Metabolonote, as a metadata preparation tool, is complementary to high-quality and persistent data repositories such as MetaboLights. A total of 808 metadata for analyzed data obtained from 35 biological species are published currently. Metabolonote and related tools are available free of cost at http://metabolonote.kazusa.or.jp/.

Book ChapterDOI
31 May 2015
TL;DR: A module of lemon is developed, named LIME Linguistic Metadata, which extends VoID with a vocabulary of metadata about the ontology-lexicon interface, which is unable by itself to represent the more specific metadata relevant to lemon.
Abstract: The OntoLex W3C Community Group has been working for more than three years on a shared lexicon model for ontologies, called lemon. The lemon model consists of a core model that is complemented by a number of modules accounting for specific aspects in the modeling of lexical information within ontologies. In many usage scenarios, the discovery and exploitation of linguistically grounded ontologies may benefit from summarizing information about their linguistic expressivity and lexical coverage by means of metadata. That situation is compounded by the fact that lemon allows the independent publication of ontologies, lexica and lexicalizations linking them. While the VoID vocabulary already addresses the need for general metadata about interlinked datasets, it is unable by itself to represent the more specific metadata relevant to lemon. To solve this problem, we developed a module of lemon, named LIME Linguistic Metadata, which extends VoID with a vocabulary of metadata about the ontology-lexicon interface.

Patent
13 Feb 2015
TL;DR: In this article, the file system metadata associated with file system objects that have been created and/or modified since the last backup is used to generate metadata files for the incremental backup.
Abstract: Metadata generate for incremental backup is disclosed. A subset of blocks used to store file system metadata are identified in a set of blocks changed since a last backup. File system metadata stored in the subset of blocks is used to obtain file system metadata associated with file system objects that have been created and/or modified since the last backup. The file system metadata associated with file system objects that have been created and/or modified since the last backup is used to generate file system metadata files for the incremental backup.

Patent
24 Sep 2015
TL;DR: In this article, the authors describe a data enrichment system that enables declarative external data source importation and exportation, where a user can specify via a user interface input for identifying different data sources from which to obtain input data.
Abstract: Techniques are disclosure for a data enrichment system that enables declarative external data source importation and exportation. A user can specify via a user interface input for identifying different data sources from which to obtain input data. The data enrichment system is configured to import and export various types of sources storing resources such as URL-based resources and HDFS-based resources for high-speed bi-directional metadata and data interchange. Connection metadata (e.g., credentials, access paths, etc.) can be managed by the data enrichment system in a declarative format for managing and visualizing the connection metadata.

Patent
27 Feb 2015
TL;DR: In this paper, the authors propose a platform for data management that leverages a metadata repository, which tracks and manages all aspects of the data lifecycle, including status information (load dates, quality exceptions, access rights, etc.), definitions (business meaning, technical formats, etc.).
Abstract: An analytical computing environment for large data sets comprises a software platform for data management. The platform provides various automation and self-service features to enable those users to rapidly provision and manage an agile analytics environment. The platform leverages a metadata repository, which tracks and manages all aspects of the data lifecycle. The repository maintains various types of platform metadata including, for example, status information (load dates, quality exceptions, access rights, etc.), definitions (business meaning, technical formats, etc.), lineage (data sources and processes creating a data set, etc.), and user data (user rights, access history, user comments, etc.). Within the platform, the metadata is integrated with all platform services, such as load processing, quality controls and system use. As the system is used, the metadata gets richer and more valuable, supporting additional automation and quality controls.

Book ChapterDOI
31 May 2015
TL;DR: The development of the META-SHARE ontology is presented, which transforms the metadata schema used by META -SHARE into ontology in the Web Ontology Language OWL that can better handle the diversity of metadata found in legacy and crowd-sourced resources.
Abstract: META-SHARE is an infrastructure for sharing Language Resources LRs where significant effort has been made into providing carefully curated metadata about LRs. However, in the face of the flood of data that is used in computational linguistics, a manual approach cannot suffice. We present the development of the META-SHARE ontology, which transforms the metadata schema used by META-SHARE into ontology in the Web Ontology Language OWL that can better handle the diversity of metadata found in legacy and crowd-sourced resources. We show how this model can interface with other more general purpose vocabularies for online datasets and licensing, and apply this model to the CLARIN VLO, a large source of legacy metadata about LRs. Furthermore, we demonstrate the usefulness of this approach in two public metadata portals for information about language resources.

Patent
09 Feb 2015
TL;DR: In this article, the authors present a vulnerability assessment technique for highlighting an organization's information technology (IT) infrastructure security vulnerabilities, which includes application metadata including unique software identifiers for each of the plurality of executable applications.
Abstract: Presented herein are vulnerability assessment techniques for highlighting an organization's information technology (IT) infrastructure security vulnerabilities. For example, a vulnerability assessment system obtains application metadata for each of a plurality of executable applications observed at one or more devices forming part of an organization's IT infrastructure. The application metadata includes unique software identifiers for each of the plurality of executable applications. The vulnerability assessment system obtains global security risk metadata for executable applications observed at the one or more devices. The vulnerability assessment system maps one or more unique software identifiers in the application metadata to global security risk metadata that corresponds to applications identified by the one or more unique software identifiers, thereby generating a vulnerable application dataset.

Patent
02 Apr 2015
TL;DR: In this article, the authors present a system and method tracking music or other audio metadata from a number of sources in real-time on an electronic device and displaying this information as a unified music feed using a graphical and textual interface.
Abstract: The present invention relates to a system and method tracking music or other audio metadata from a number of sources in real-time on an electronic device and displaying this information as a unified music feed using a graphical and textual interface. In one embodiment the invention provides a system and method for sharing such information within a social network or other conveyance system in order to aggregate crowd sourced, location-based and real-time information by combining the location, timestamp and metadata of user's listening history on such an electronic device.

Book ChapterDOI
31 May 2015
TL;DR: This work proposes a scalable automatic approach for extracting, validating, correcting and generating descriptive linked dataset profiles and applies several techniques in order to check the validity of the metadata provided and to generate descriptive and statistical information for a particular dataset or for an entire data portal.
Abstract: Linked Open Data LOD has emerged as one of the largest collections of interlinked datasets on the web. In order to benefit from this mine of data, one needs to access to descriptive information about each dataset or metadata. This information can be used to delay data entropy, enhance dataset discovery, exploration and reuse as well as helping data portal administrators in detecting and eliminating spam. However, such metadata information is currently very limited to a few data portals where they are usually provided manually, thus being often incomplete and inconsistent in terms of quality. To address these issues, we propose a scalable automatic approach for extracting, validating, correcting and generating descriptive linked dataset profiles. This approach applies several techniques in order to check the validity of the metadata provided and to generate descriptive and statistical information for a particular dataset or for an entire data portal.

Patent
22 Oct 2015
TL;DR: In this paper, the metadata is stored in a track in a self-describing structure and the metadata track may be decoded using an identifier reference table that is substantially smaller than typical fourCC identifier tables.
Abstract: Apparatus and methods for combining metadata with video into a video stream using a 32-bit aligned payload, that is computer storage efficient and human discernable. The metadata is stored in a track in a self-describing structure. Metadata track may be decoded using an identifier reference table that is substantially smaller than typical fourCC identifier tables. The combined metadata/video stream is compatible with a standard video stream convention and may be played using conventional media player applications that reads media files compliant with MP4/MOV container format. The proposed format may enable decoding of metadata during streaming, partitioning of combined video stream without loss of metadata. The proposed format and/or metadata protocol provides for temporal synchronization of metadata with video frames.

Journal ArticleDOI
TL;DR: This paper proposes an approach for finding semantic associations which would not emerge without considering the structure of the data groups, and is based on the introduction of a new metadata model, that is an extension of the direct, labelled graph allowing the possibility to have nodes with a hierarchical structure.
Abstract: Most of the activities usually performed by Web users are today effectively supported by using appropriate metadata that make the Web practically readable by software agents operating as users' assistants. While the original use of metadata mostly focused on improving queries on Web knowledge bases, as in the case of SPARQL-based applications on RDF data, other approaches have been proposed to exploit the semantic information contained in metadata for performing more sophisticated knowledge discovery tasks. Finding semantic associations between Web data seems a promising framework in this context, since it allows that novel, potentially interesting information can emerge by the Web's sea, deeply exploiting the semantic relationships represented by metadata. However, the approaches for finding semantic associations proposed in the past do not seem to consider how Web entities are logically collected into groups, that often have a complex hierarchical structure. In this paper, we focus on the importance of taking into account this additional information, and we propose an approach for finding semantic associations which would not emerge without considering the structure of the data groups. Our approach is based on the introduction of a new metadata model, that is an extension of the direct, labelled graph allowing the possibility to have nodes with a hierarchical structure. To evaluate our approach, we have implemented it on the top of an existing recommender system for Web users, experimentally analyzing the introduced advantages in terms of effectiveness of the recommendation activity.