scispace - formally typeset
Open AccessJournal ArticleDOI

The limits of Web metadata, and beyond

TLDR
In this paper, the authors address the problem of how to cope with such intrinsic limits of Web metadata, and propose a method that is able to partially solve the above two problems, and showing concrete evidence of its effectiveness.
Abstract
The World Wide Web currently has a huge amount of data, with practically no classification information, and this makes it extremely difficult to handle effectively. It has been realized recently that the only feasible way to radically improve the situation is to add to Web objects a metadata classification, to help search engines and Web-based digital libraries to properly classify and structure the information present in the WWW. However, having a few standard metadata sets is insufficient in order to have a fully classified World Wide Web. The first major problem is that it will take some time before a reasonable number of people start using metadata to provide a better Web classification. The second major problem is that no one can guarantee that a majority of the Web objects will be ever properly classified via metadata. In this paper, we address the problem of how to cope with such intrinsic limits of Web metadata, proposing a method that is able to partially solve the above two problems, and showing concrete evidence of its effectiveness. In addition, we examine the important problem of what is the required “critical mass” in the World Wide Web for metadata in order for it to be really useful.

read more

Content maybe subject to copyright    Report

The limits of Web metadata, and beyond
Massimo Marchiori
The World Wide Web Consortium (W3C), MIT Laboratory for Computer Science,
545 Technology Square, Cambridge, MA 02139, U.S.A.
max@lcs.mit.edu
Abstract
The World Wide Web currently has a huge amount of data, with practically no classification information,
and this makes it extremely difficult to handle effectively. It has been realized recently that the only feasible
way to radically improve the situation is to add to Web objects a metadata classification, to help search
engines and Web-based digital libraries to properly classify and structure the information present in the
WWW. However, having a few standard metadata sets is insufficient in order to have a fully classified
World Wide Web. The first major problem is that it will take some time before a reasonable number of
people start using metadata to provide a better Web classification. The second major problem is that no one
can guarantee that a majority of the Web objects will be ever properly classified via metadata. In this paper,
we address the problem of how to cope with such intrinsic limits of Web metadata, proposing a method that
is able to partially solve the above two problems, and showing concrete evidence of its effectiveness. In
addition, we examine the important problem of what is the required "critical mass" in the World Wide Web
for metadata in order for it to be really useful.
Keywords
Metadata; Automatic classification; Information retrieval; Search engines; Filtering
1. Introduction
The World Wide Web currently has a huge amount of data, with practically no classification information. It is well
known that this makes it extremely difficult to effectively handle the enormous amount of information present in
the WWW, as witnessed by everybody's personal experience, and by every recent study on the difficulties of
information retrieval on the Web. It has been realized recently that the only feasible way to radically improve the
situation is to add to Web objects a metadata classification, that is to say partially passing the task of classifying
the content of Web objects from search engines and repositories to the users who are building and maintaining
such objects. Currently, there are lots of preliminary proposals on how to build suitable metadata sets that can be
effectively incorporated into HTML and that can help search engines and Web-based digital libraries to properly
classify and structure the information present in the WWW (see e.g. [2,3,4,5,6,7,9,11]).
However, usage of metadata presents the big problem that it does not suffice to have some standard metadata sets
in order to have a fully classified World Wide Web. The first major problem is that a long time will pass before a
reasonable number of people will start using metadata to provide a better Web classification. The second big
problem is that none can guarantee that a majority of the Web objects will be ever properly classified via metadata,
since by their nature metadata:
are an optional feature
they make heavier the writing of Web objects
their usage cannot be imposed (this is necessary to ensure HTML backward compatibility)
Thus, even in the more optimistic view, there will be a long transitory period where very few people will employ
metadata to improve the informative content of their Web objects, and after such period there will still be a
relevant part of the WWW that will not make proper use of the metadata facilities.
In this paper, we address the problem of how to cope with such intrinsic limits of Web metadata. The main idea is

to start from those Web objects that use some form of metadata classification, and extend this information
thorough the whole Web, via suitable propagation of the metadata information. The proposed propagation
technique relies only on the connectivity structure on the Web, and is fully automatic. This means that
it acts on top of any specific metadata format, and1.
it does not require any form of ad-hoc semantic analysis (which can nevertheless be smoothly added on top
of the propagation mechanisms), therefore being completely language independent.
2.
Another important feature is that the propagation is done by fuzzifying the metadata attributes. This means that the
obtained metadata information can go beyond the capability of usual crisp metadata sets, providing more
information about the strength of various metadata attributes, and also resolving eventual expressive deficiencies
of the original metadata sets.
The method is applied to a variety of tests. Although, of course, they cannot provide a complete and thorough
evaluation of the method (for which a greater number of tests would be needed), we think they give a meaningful
indication of its potential, providing a reasonably good first study of the subject.
In the first general test, we show how the effectiveness of a metadata classification can be enhanced significantly.
We also study the relationship between the use of metadata in the Web (how many Web objects have been
classified) and the global usefulness of the corresponding classification. This enables also to answer questions like
what is the "critical mass" for general WWW metadata (that is to say, how much metadata is needed in an area of
the Web in order for it to be really useful).
Then, we present another practical application of the method, this time not in conjunction with a static metadata
description, but with a dynamic metadata generator. We show how even a poorly performing automatic classifier
can be enhanced significantly using the propagation method.
Finally, we apply the method to an important practical case study, namely the case of PICS-compliant metadata,
that allow parents to filter the Web from pages that can be offensive or dangerous for kids. The method is shown
to provide a striking performance in this case, thus showing to be already directly applicable with success in
real-life situations.
2. Fuzzy metadata
As stated in the introduction, existing WWW metadata sets are crisp, in the sense that they only have attributes,
without any capability of expressing "fuzziness": either they assign an attribute to an object or they do not. Instead,
we will fuzzify attributes, associating to each attribute a fuzzy measure of its relevance for the Web object, namely
a number ranging from 0 to 1 (cf. [1]). An associated value of 0 means that the attribute is not pertinent at all for
the Web object, while on the other hand a value of 1 implies that the attribute is fully pertinent. All the
intermediate values in the interval 0–1 can be used to give a measure of how much the attribute is relevant to the
Web object. For instance, one could have that an attribute is only relevant roughly for the 20% (and in this case the
value would be 0.2), or is relevant for the 50% (value of 0.5), and so on. This allows greater flexibility, since a
metadata categorization is by its very nature an approximation of a large and complex variety of cases into a
summarization and simplification of concepts. This also has the beneficial effect that one can keep the complexity
of the basic categorization quite small, and need not artificially enlarge it significantly in order to cope with the
various intermediate categorizations (as is well known, a crucial factor in the success of categorizations, especially
of those that pretend to be mass-used, is simplicity). Indeed the number of basic concepts can be kept reasonably
small, and then fuzzification can be employed to obtain a more complete and detailed variety of descriptions. Also,
this allows us to cope with intimate deficiencies of already existing categorizations, in case they are not flexible
enough.
3. Back-propagation

Suppose a certain Web object O has the associated metadatum A:v, indicating that the attribute A has fuzzy value
v. If there is another Web object O' with an hyperlink to O, then we can "back-propagate" this metadata
information from O to O'. The intuition is that the information contained in O (classified as A:v) is also reachable
from O', since we can just activate the hyperlink. However, the situation is not like being already in O, since in
order to reach the information we have to activate the hyperlink and then wait for O to be fetched. So, the
relevance of O' with respect to the attribute A is not the same as O' (v), but is in a sense faded, since the
information in O is only potentially reachable from O', but not directly contained therein. The solution to this
problem is to fade the value v of the attribute multiplying it by a "fading factor" f (with 0<f<1). So, in the above
example O' could be classified as A:v·f. The same reasonment is then applied recursively. So, if we have another
Web object O'' with an hyperlink to O', we can back-propagate the obtained metadatum A:v·f exactly in the same
way, obtaining that O'' has the corresponding metadatum
A
:v·f·f.
3.1. Combining information
We have seen how it is possible to back-propagate metadata information along the Web structure. There is still
another issue to consider, that is to say how to combine different metadata values that are pertinent to the same
Web object. This can happen because a Web object can have several hyperlinks pointing to many other Web
objects, and so it can be the case that several metadata information is back-propagated. Also, it can happen
because the considered Web object has already some metadata information present in it.
The solution to these issues is not difficult. One case is easy: if the attributes are different, then we simply merge
the information. For instance, if we calculate via back-propagation that a Web object has the metadatum A:v and
also the metadatum B:w, then we can infer that the Web object has both the metadata, i.e. it can be classified as
"A:v, B:w". The other case is when the attributes are the same. For instance, we may arrive to calculate via
back-propagation that a Web object has attributes
A
:v (this metadatum back-propagated by a certain Web object)
and A:w (this metadatum propagated by another Web object). The employed solution is to employ fuzzy arithmetic
and use the so-called fuzzy or of the values v and w, that is to say we take the maximum value between the two. So,
O would be classified as
A
:max(v,w) (here, max denotes as usual the maximum operator). This operation directly
stems from classic fuzzy logic, and its intuition is that if we have several different choices for the same value (in
this case, v and w). then we collect the most informative choice (that is, we take the maximum value), and discard
the other less informative values. Above we only discussed the case of two metadata, but it is completely obvious
how to extend the discussion to the case of several metadata, just by repeatedly combining them using the above
operations, until the final metadatum is obtained.
As a technical aside, we remark how the above method can be seen as a "first-order" approximation of the
back-propagation of the so-called hyper-information (see [8]); along the same line, more refined "higher-order"
back-propagation techniques can be built, at the expense of computation speed.
4. Testing
In order to measure the effectiveness of the approach, we first need to "restrict" in some sense the Web to a more
manageable size, that is to say to perform our studies in a reasonably sized region of it. In the following we explain
how we coped with this problem.
Suppose to have a region of Web space (i.e. a collection of Web objects), say S; the following process augments it
by considering the most proximal neighbourhoods:
Consider all the hyperlinks present in the set of Web objects S: retrieve all the corresponding Web
objects and add them to the set S.
We can repeat this process as we want, until we reach a decently sized region of the Web S: we can impose as stop
condition that the number of Web objects in S must be a certain number n.
As an aside, note we are overlooking for the sake of simplicity some implementation issues that we had to face,

like for instance: checking that no duplications arise (that is to say, that the same Web object is not added to S
more than once); checking that we have not reached a stale point (in some particular case, it may happen that the
Web objects in S have only hyperlinks to themselves, and so the augmentation process does not increase the size
of S); cutting the size of S exactly to the bound n (the augmentation process may in one step significantly exceed
n).
So, in order to extract a region of the Web to work with, the following process is used:
A random Web object is chosen (to this aim, we have employed URouLette).1.
It is augmented repeatedly using the aforementioned technique, until a Web region of the desired size n is
reached.
2.
Once a region S of the Web has been selected this way, we have to randomly select a certain percentage p, and
manually classify them with metadata; then, we propagate the metadata using the back-propagation method.
4.1. Experimentation
For our general experimentation we have employed as metadata classification the well known Excite Ontology,
also known as the "Channels Classification": it is a tree-like set of attributes (also known as categories), and is one
of the best existing metadata sets for the general classification of World Wide Web objects.
We did several tests to verify the effectiveness of the back-propagation method, and also in order to decide what
was the best choice for the fading factor f. Our outcome was that good performances can be obtained by using the
following rule-of-thumb:
f = 1 – p
Intuitively, this means that when the percentage of already classified Web objects is low (p very low), then we
need a correspondingly high value of f (f near to one), because we need to back-propagate more and we have to
ensure that the fading does not become too low. On the other hand, when the percentage of already classified Web
objects becomes higher, we need to back-propagate less, and so the fading can be kept smaller (in the limit case
when p=1, the rule gives indeed that f=0, that is to say we do not need to back-propagate at all).
Even if our personal tests were quite good, our evaluations could be somehow dependent on our view of judging
metadata. Moreover, besides confirmations on the effectiveness of the approach, we also needed more large-scale
results analyzing in detail the situation with respect to the percentage p of already classified Web objects.
Therefore, we performed a more challenging series of tests, this time involving an external population of 34 users.
The initial part of the tests was exactly like before: first, a random Web object is selected, and then it is increased
until a Web region of size n is reached. Then the testing phase occurs: we randomly choose a certain number m of
Web objects that have not been manually classified, and let the users judge how well the calculated classifications
describe the Web objects. The "judgment" of a certain set of metadata with respect to a certain Web object
consists in a number ranging from 0 to 100, with the intended meaning that 0 stands for a completely
wrong/useless metadata, while 100 for a careful and precise description of the considered Web object.
It is immediate to see that in the original situation, without back-propagation, the expected average judgment when
p=n% is just n (for example, if p=7%, then the average judgment is 7%). So, we have to show how much
back-propagation can improve on this original situation.
As said, for the tests we employed a population of 34 persons: they were only aware of the finalities of the
experiment (to judge different methods of generating metadata), but they were not aware of our current method.
Moreover, we tried to go further and to limit every possible "psychological pressure", clearly stating that we
wanted to measure the qualitative differences between different metadata set, and so the absolute judgment was
not a concern, while the major point was the judgment differences.

The first round of tests was done setting the size n of the Web region to 200, and the number m of Web objects
that the users had to classify to 20 (which corresponds to the 10% of the total considered Web objects). Various
tests were performed, in order to study the effectiveness of the method with the varying of the parameter p. We run
the judgment test for each of the following percentages p of classified Web objects: p=2.5%, p=5%, p=7.5%,
p=10%, p=15%, p=20%, p=30%, p=40%, p=50%, p=60%, p=70%, p=80% and p=90%. The final average results
are shown in the following table:
Average judgments in the n=200 case.
p:
2.5% 5% 7.5% 10% 15% 20% 30% 40% 50% 60% 70% 80% 90%
judgment:
17.1 24.2 31.8 39.7 44.9 56.4 62.9 72.1 78.3 83.8 92.2 95.8 98.1
The data are graphically summarized in the following chart, where an interpolated spline is used to better
emphasize the dependency between the percentage p of already classified Web objects and the obtained judgment
j (here, the diagonal magenta dotted line indicates the average judgment in the original case without
back-propagation):
As we can see, there is an almost linear dependence roughly from the value of p=20% on, until the value of
p=70% where the judgment starts its asymptotic deviation towards the upper bound. The behaviour is also rather
clear in the region from p=5% to p=20%: however, since in practice the lower percentages of p are also the more
important to study, we run a second round of tests specifically devoted to this zone of p values. This time, the size
of the Web region was five times bigger, setting n=1000, and the value m was set to 20 (that is to say, the 2% of
the Web objects). Again, various tests were performed, each one with a different value of p. This time, the
granularity was increased, so that while in the previous phase we had the values of p=2.5%, p=5%, p=7.5%,
p=10%, p=15%, p=20%, in this new phase we also run the tests for the extra cases p=12.5% and p=17.5%, so to
perform an even more precise analysis of this important zone. The final average results are shown in the next
table:

Citations
More filters
Journal ArticleDOI

Trawling the Web for emerging cyber-communities

TL;DR: The subject of this paper is the systematic enumeration of over 100,000 emerging communities from a Web crawl, motivating a graph-theoretic approach to locating such communities, and describing the algorithms and algorithmic engineering necessary to find structures that subscribe to this notion.
Journal ArticleDOI

Shaping the Web: Why the Politics of Search Engines Matters

TL;DR: It is argued that search engines raise not merely technical issues but also political ones, raising doubts whether, in particular, the market mechanism could serve as an acceptable corrective.
Journal ArticleDOI

Toward a basic framework for webometrics

TL;DR: A consistent and detailed link typology and terminology is developed and a novel diagram notation is proposed to fully appreciate and investigate link structures between Web nodes in webometric analyses.
Proceedings ArticleDOI

Influence and passivity in social media

TL;DR: In this paper, a large study of information propagation within Twitter reveals that the majority of users act as passive information consumers and do not forward the content to the network, therefore, in order for individuals to become influential they must not only obtain attention and thus be popular, but also overcome user passivity.
Journal ArticleDOI

ScentTrails: Integrating browsing and searching on the Web

TL;DR: This work introduces a novel approach called ScentTrails, based on the concept of information scent developed in the context of information foraging theory, that enables users to interpolate smoothly between searching and browsing to locate content matching complex information goals effectively.
References
More filters
Journal ArticleDOI

Fuzzy models—What are they, and why? [Editorial]

TL;DR: This preface is divided into two parts, in hopes that this glimpse will help readers understand the articles, but perhaps more importantly, pique your interest in fuzzy models.
Journal ArticleDOI

The quest for correct information on the Web: hyper search engines

TL;DR: This paper presents a novel method to extract from a web object its “hyper” informative content, in contrast with current search engines, which only deal with the “textual’ informative content.

Metadata for the Masses

Paul Miller
TL;DR: This article aims to explore some of the issues involved in metadata and then to show in a non-technical fashion how metadata may be used by anyone to make their material more accessible.
Posted Content

Rating the Net

TL;DR: Concerns are especially troubling because it seems likely that many adults will reach the Net through approaches monitored by filtering software and the necessity of excluding unrated sites may disproportionately bar speech that was not created by commercial providers for a mass audience.
Frequently Asked Questions (9)
Q1. What are the contributions mentioned in the paper "The limits of web metadata, and beyond" ?

The first major problem is that it will take some time before a reasonable number of people start using metadata to provide a better Web classification. In this paper, the authors address the problem of how to cope with such intrinsic limits of Web metadata, proposing a method that is able to partially solve the above two problems, and showing concrete evidence of its effectiveness. In addition, the authors examine the important problem of what is the required `` critical mass '' in the World Wide Web for metadata in order for it to be really useful. 

For their general experimentation the authors have employed as metadata classification the well known Excite Ontology, also known as the "Channels Classification": it is a tree-like set of attributes (also known as categories), and is one of the best existing metadata sets for the general classification of World Wide Web objects. 

The "judgment" of a certain set of metadata with respect to a certain Web object consists in a number ranging from 0 to 100, with the intended meaning that 0 stands for a completely wrong/useless metadata, while 100 for a careful and precise description of the considered Web object. 

The first round of tests was done setting the size n of the Web region to 200, and the number m of Web objects that the users had to classify to 20 (which corresponds to the 10% of the total considered Web objects). 

It has been realized recently that the only feasible way to radically improve the situation is to add to Web objects a metadata classification, that is to say partially passing the task of classifying the content of Web objects from search engines and repositories to the users who are building and maintaining such objects. 

When employing a particular metadata classification for the Web, even when the percentage of classified Web objects is relatively small, usage of the back-propagation method can significantly help the effectiveness of the classification, thus helping the metadata to get more and more widespread, especially in the initial crucial phases when the number of classified objects will be extremely limited. 

In order to measure the effectiveness of the approach, the authors first need to "restrict" in some sense the Web to a more manageable size, that is to say to perform their studies in a reasonably sized region of it. 

2.Once a region S of the Web has been selected this way, the authors have to randomly select a certain percentage p, and manually classify them with metadata; then, the authors propagate the metadata using the back-propagation method. 

Indeed the number of basic concepts can be kept reasonably small, and then fuzzification can be employed to obtain a more complete and detailed variety of descriptions.