how many objects are classified in the web?

When employing a particular metadata classification for the Web, even when the percentage of classified Web objects is relatively small, usage of the back-propagation method can significantly help the effectiveness of the classification, thus helping the metadata to get more and more widespread, especially in the initial crucial phases when the number of classified objects will be extremely limited.

(Open Access) The limits of Web metadata, and beyond (1998) | Massimo Marchiori

Q: What is the method for enhancing a region of the Web?

For their general experimentation the authors have employed as metadata classification the well known Excite Ontology, also known as the "Channels Classification": it is a tree-like set of attributes (also known as categories), and is one of the best existing metadata sets for the general classification of World Wide Web objects.

Q: What is the meaning of a certain set of metadata?

The "judgment" of a certain set of metadata with respect to a certain Web object consists in a number ranging from 0 to 100, with the intended meaning that 0 stands for a completely wrong/useless metadata, while 100 for a careful and precise description of the considered Web object.

Q: How many Web objects were used in the first round of tests?

The first round of tests was done setting the size n of the Web region to 200, and the number m of Web objects that the users had to classify to 20 (which corresponds to the 10% of the total considered Web objects).

Q: What is the way to measure the effectiveness of the approach?

In order to measure the effectiveness of the approach, the authors first need to "restrict" in some sense the Web to a more manageable size, that is to say to perform their studies in a reasonably sized region of it.

Q: How do the authors add a region of the Web to the set of objects?

2.Once a region S of the Web has been selected this way, the authors have to randomly select a certain percentage p, and manually classify them with metadata; then, the authors propagate the metadata using the back-propagation method.

The limits of Web metadata, and beyond

Massimo Marchiori

The World Wide Web Consortium (W3C), MIT Laboratory for Computer Science,

545 Technology Square, Cambridge, MA 02139, U.S.A.

max@lcs.mit.edu

Abstract

The World Wide Web currently has a huge amount of data, with practically no classification information,

and this makes it extremely difficult to handle effectively. It has been realized recently that the only feasible

way to radically improve the situation is to add to Web objects a metadata classification, to help search

engines and Web-based digital libraries to properly classify and structure the information present in the

WWW. However, having a few standard metadata sets is insufficient in order to have a fully classified

World Wide Web. The first major problem is that it will take some time before a reasonable number of

people start using metadata to provide a better Web classification. The second major problem is that no one

can guarantee that a majority of the Web objects will be ever properly classified via metadata. In this paper,

we address the problem of how to cope with such intrinsic limits of Web metadata, proposing a method that

is able to partially solve the above two problems, and showing concrete evidence of its effectiveness. In

addition, we examine the important problem of what is the required "critical mass" in the World Wide Web

for metadata in order for it to be really useful.

Keywords

Metadata; Automatic classification; Information retrieval; Search engines; Filtering

1. Introduction

The World Wide Web currently has a huge amount of data, with practically no classification information. It is well

known that this makes it extremely difficult to effectively handle the enormous amount of information present in

the WWW, as witnessed by everybody's personal experience, and by every recent study on the difficulties of

information retrieval on the Web. It has been realized recently that the only feasible way to radically improve the

situation is to add to Web objects a metadata classification, that is to say partially passing the task of classifying

the content of Web objects from search engines and repositories to the users who are building and maintaining

such objects. Currently, there are lots of preliminary proposals on how to build suitable metadata sets that can be

effectively incorporated into HTML and that can help search engines and Web-based digital libraries to properly

classify and structure the information present in the WWW (see e.g. [2,3,4,5,6,7,9,11]).

However, usage of metadata presents the big problem that it does not suffice to have some standard metadata sets

in order to have a fully classified World Wide Web. The first major problem is that a long time will pass before a

reasonable number of people will start using metadata to provide a better Web classification. The second big

problem is that none can guarantee that a majority of the Web objects will be ever properly classified via metadata,

since by their nature metadata:

are an optional feature

they make heavier the writing of Web objects

their usage cannot be imposed (this is necessary to ensure HTML backward compatibility)

Thus, even in the more optimistic view, there will be a long transitory period where very few people will employ

metadata to improve the informative content of their Web objects, and after such period there will still be a

relevant part of the WWW that will not make proper use of the metadata facilities.

In this paper, we address the problem of how to cope with such intrinsic limits of Web metadata. The main idea is

to start from those Web objects that use some form of metadata classification, and extend this information

thorough the whole Web, via suitable propagation of the metadata information. The proposed propagation

technique relies only on the connectivity structure on the Web, and is fully automatic. This means that

it acts on top of any specific metadata format, and1.

it does not require any form of ad-hoc semantic analysis (which can nevertheless be smoothly added on top

of the propagation mechanisms), therefore being completely language independent.

Another important feature is that the propagation is done by fuzzifying the metadata attributes. This means that the

obtained metadata information can go beyond the capability of usual crisp metadata sets, providing more

information about the strength of various metadata attributes, and also resolving eventual expressive deficiencies

of the original metadata sets.

The method is applied to a variety of tests. Although, of course, they cannot provide a complete and thorough

evaluation of the method (for which a greater number of tests would be needed), we think they give a meaningful

indication of its potential, providing a reasonably good first study of the subject.

In the first general test, we show how the effectiveness of a metadata classification can be enhanced significantly.

We also study the relationship between the use of metadata in the Web (how many Web objects have been

classified) and the global usefulness of the corresponding classification. This enables also to answer questions like

what is the "critical mass" for general WWW metadata (that is to say, how much metadata is needed in an area of

the Web in order for it to be really useful).

Then, we present another practical application of the method, this time not in conjunction with a static metadata

description, but with a dynamic metadata generator. We show how even a poorly performing automatic classifier

can be enhanced significantly using the propagation method.

Finally, we apply the method to an important practical case study, namely the case of PICS-compliant metadata,

that allow parents to filter the Web from pages that can be offensive or dangerous for kids. The method is shown

to provide a striking performance in this case, thus showing to be already directly applicable with success in

real-life situations.

2. Fuzzy metadata

As stated in the introduction, existing WWW metadata sets are crisp, in the sense that they only have attributes,

without any capability of expressing "fuzziness": either they assign an attribute to an object or they do not. Instead,

we will fuzzify attributes, associating to each attribute a fuzzy measure of its relevance for the Web object, namely

a number ranging from 0 to 1 (cf. [1]). An associated value of 0 means that the attribute is not pertinent at all for

the Web object, while on the other hand a value of 1 implies that the attribute is fully pertinent. All the

intermediate values in the interval 0–1 can be used to give a measure of how much the attribute is relevant to the

Web object. For instance, one could have that an attribute is only relevant roughly for the 20% (and in this case the

value would be 0.2), or is relevant for the 50% (value of 0.5), and so on. This allows greater flexibility, since a

metadata categorization is by its very nature an approximation of a large and complex variety of cases into a

summarization and simplification of concepts. This also has the beneficial effect that one can keep the complexity

of the basic categorization quite small, and need not artificially enlarge it significantly in order to cope with the

various intermediate categorizations (as is well known, a crucial factor in the success of categorizations, especially

of those that pretend to be mass-used, is simplicity). Indeed the number of basic concepts can be kept reasonably

small, and then fuzzification can be employed to obtain a more complete and detailed variety of descriptions. Also,

this allows us to cope with intimate deficiencies of already existing categorizations, in case they are not flexible

enough.

3. Back-propagation

Suppose a certain Web object O has the associated metadatum A:v, indicating that the attribute A has fuzzy value

v. If there is another Web object O' with an hyperlink to O, then we can "back-propagate" this metadata

information from O to O'. The intuition is that the information contained in O (classified as A:v) is also reachable

from O', since we can just activate the hyperlink. However, the situation is not like being already in O, since in

order to reach the information we have to activate the hyperlink and then wait for O to be fetched. So, the

relevance of O' with respect to the attribute A is not the same as O' (v), but is in a sense faded, since the

information in O is only potentially reachable from O', but not directly contained therein. The solution to this

problem is to fade the value v of the attribute multiplying it by a "fading factor" f (with 0<f<1). So, in the above

example O' could be classified as A:v·f. The same reasonment is then applied recursively. So, if we have another

Web object O'' with an hyperlink to O', we can back-propagate the obtained metadatum A:v·f exactly in the same

way, obtaining that O'' has the corresponding metadatum

:v·f·f.

3.1. Combining information

We have seen how it is possible to back-propagate metadata information along the Web structure. There is still

another issue to consider, that is to say how to combine different metadata values that are pertinent to the same

Web object. This can happen because a Web object can have several hyperlinks pointing to many other Web

objects, and so it can be the case that several metadata information is back-propagated. Also, it can happen

because the considered Web object has already some metadata information present in it.

The solution to these issues is not difficult. One case is easy: if the attributes are different, then we simply merge

the information. For instance, if we calculate via back-propagation that a Web object has the metadatum A:v and

also the metadatum B:w, then we can infer that the Web object has both the metadata, i.e. it can be classified as

"A:v, B:w". The other case is when the attributes are the same. For instance, we may arrive to calculate via

back-propagation that a Web object has attributes

:v (this metadatum back-propagated by a certain Web object)

and A:w (this metadatum propagated by another Web object). The employed solution is to employ fuzzy arithmetic

and use the so-called fuzzy or of the values v and w, that is to say we take the maximum value between the two. So,

O would be classified as

:max(v,w) (here, max denotes as usual the maximum operator). This operation directly

stems from classic fuzzy logic, and its intuition is that if we have several different choices for the same value (in

this case, v and w). then we collect the most informative choice (that is, we take the maximum value), and discard

the other less informative values. Above we only discussed the case of two metadata, but it is completely obvious

how to extend the discussion to the case of several metadata, just by repeatedly combining them using the above

operations, until the final metadatum is obtained.

As a technical aside, we remark how the above method can be seen as a "first-order" approximation of the

back-propagation of the so-called hyper-information (see [8]); along the same line, more refined "higher-order"

back-propagation techniques can be built, at the expense of computation speed.

4. Testing

In order to measure the effectiveness of the approach, we first need to "restrict" in some sense the Web to a more

manageable size, that is to say to perform our studies in a reasonably sized region of it. In the following we explain

how we coped with this problem.

Suppose to have a region of Web space (i.e. a collection of Web objects), say S; the following process augments it

by considering the most proximal neighbourhoods:

Consider all the hyperlinks present in the set of Web objects S: retrieve all the corresponding Web

objects and add them to the set S.

We can repeat this process as we want, until we reach a decently sized region of the Web S: we can impose as stop

condition that the number of Web objects in S must be a certain number n.

As an aside, note we are overlooking for the sake of simplicity some implementation issues that we had to face,

like for instance: checking that no duplications arise (that is to say, that the same Web object is not added to S

more than once); checking that we have not reached a stale point (in some particular case, it may happen that the

Web objects in S have only hyperlinks to themselves, and so the augmentation process does not increase the size

of S); cutting the size of S exactly to the bound n (the augmentation process may in one step significantly exceed

n).

So, in order to extract a region of the Web to work with, the following process is used:

A random Web object is chosen (to this aim, we have employed URouLette).1.

It is augmented repeatedly using the aforementioned technique, until a Web region of the desired size n is

reached.

Once a region S of the Web has been selected this way, we have to randomly select a certain percentage p, and

manually classify them with metadata; then, we propagate the metadata using the back-propagation method.

4.1. Experimentation

For our general experimentation we have employed as metadata classification the well known Excite Ontology,

also known as the "Channels Classification": it is a tree-like set of attributes (also known as categories), and is one

of the best existing metadata sets for the general classification of World Wide Web objects.

We did several tests to verify the effectiveness of the back-propagation method, and also in order to decide what

was the best choice for the fading factor f. Our outcome was that good performances can be obtained by using the

following rule-of-thumb:

f = 1 – p

Intuitively, this means that when the percentage of already classified Web objects is low (p very low), then we

need a correspondingly high value of f (f near to one), because we need to back-propagate more and we have to

ensure that the fading does not become too low. On the other hand, when the percentage of already classified Web

objects becomes higher, we need to back-propagate less, and so the fading can be kept smaller (in the limit case

when p=1, the rule gives indeed that f=0, that is to say we do not need to back-propagate at all).

Even if our personal tests were quite good, our evaluations could be somehow dependent on our view of judging

metadata. Moreover, besides confirmations on the effectiveness of the approach, we also needed more large-scale

results analyzing in detail the situation with respect to the percentage p of already classified Web objects.

Therefore, we performed a more challenging series of tests, this time involving an external population of 34 users.

The initial part of the tests was exactly like before: first, a random Web object is selected, and then it is increased

until a Web region of size n is reached. Then the testing phase occurs: we randomly choose a certain number m of

Web objects that have not been manually classified, and let the users judge how well the calculated classifications

describe the Web objects. The "judgment" of a certain set of metadata with respect to a certain Web object

consists in a number ranging from 0 to 100, with the intended meaning that 0 stands for a completely

wrong/useless metadata, while 100 for a careful and precise description of the considered Web object.

It is immediate to see that in the original situation, without back-propagation, the expected average judgment when

p=n% is just n (for example, if p=7%, then the average judgment is 7%). So, we have to show how much

back-propagation can improve on this original situation.

As said, for the tests we employed a population of 34 persons: they were only aware of the finalities of the

experiment (to judge different methods of generating metadata), but they were not aware of our current method.

Moreover, we tried to go further and to limit every possible "psychological pressure", clearly stating that we

wanted to measure the qualitative differences between different metadata set, and so the absolute judgment was

not a concern, while the major point was the judgment differences.

The first round of tests was done setting the size n of the Web region to 200, and the number m of Web objects

that the users had to classify to 20 (which corresponds to the 10% of the total considered Web objects). Various

tests were performed, in order to study the effectiveness of the method with the varying of the parameter p. We run

the judgment test for each of the following percentages p of classified Web objects: p=2.5%, p=5%, p=7.5%,

p=10%, p=15%, p=20%, p=30%, p=40%, p=50%, p=60%, p=70%, p=80% and p=90%. The final average results

are shown in the following table:

Average judgments in the n=200 case.

2.5% 5% 7.5% 10% 15% 20% 30% 40% 50% 60% 70% 80% 90%

judgment:

17.1 24.2 31.8 39.7 44.9 56.4 62.9 72.1 78.3 83.8 92.2 95.8 98.1

The data are graphically summarized in the following chart, where an interpolated spline is used to better

emphasize the dependency between the percentage p of already classified Web objects and the obtained judgment

j (here, the diagonal magenta dotted line indicates the average judgment in the original case without

back-propagation):

As we can see, there is an almost linear dependence roughly from the value of p=20% on, until the value of

p=70% where the judgment starts its asymptotic deviation towards the upper bound. The behaviour is also rather

clear in the region from p=5% to p=20%: however, since in practice the lower percentages of p are also the more

important to study, we run a second round of tests specifically devoted to this zone of p values. This time, the size

of the Web region was five times bigger, setting n=1000, and the value m was set to 20 (that is to say, the 2% of

the Web objects). Again, various tests were performed, each one with a different value of p. This time, the

granularity was increased, so that while in the previous phase we had the values of p=2.5%, p=5%, p=7.5%,

p=10%, p=15%, p=20%, in this new phase we also run the tests for the extra cases p=12.5% and p=17.5%, so to

perform an even more precise analysis of this important zone. The final average results are shown in the next

table:

The limits of Web metadata, and beyond

Citations

Trawling the Web for emerging cyber-communities

Shaping the Web: Why the Politics of Search Engines Matters

Toward a basic framework for webometrics

Influence and passivity in social media

ScentTrails: Integrating browsing and searching on the Web

References

Fuzzy models—What are they, and why? [Editorial]

The quest for correct information on the Web: hyper search engines

Metadata for the Masses

You Call It Corn, We Call It Syntax-Independent Metadata for Document-Like Objects.

Rating the Net

Related Papers (5)

The anatomy of a large-scale hypertextual Web search engine

Authoritative sources in a hyperlinked environment

TextRank: Bringing Order into Text

Automatic evaluation of summaries using N-gram co-occurrence statistics

A Language Independent Algorithm for Single and Multiple Document Summarization

Frequently Asked Questions (9)

Q1. What are the contributions mentioned in the paper "The limits of web metadata, and beyond" ?

Q2. What is the method for enhancing a region of the Web?

Q3. What is the meaning of a certain set of metadata?

Q4. How many Web objects were used in the first round of tests?

Q5. What is the way to improve the situation of the World Wide Web?

Q6. how many objects are classified in the web?

Q7. What is the way to measure the effectiveness of the approach?

Q8. How do the authors add a region of the Web to the set of objects?

Q9. How many basic concepts can be kept reasonably small?