Open AccessJournal ArticleDOI

A Gray code based ordering for documents on shelves: Classification for browsing and retrieval

- 01 May 1992 -

Journal of the Association for Informati...

- Vol. 43, Iss: 4, pp 312-322

TLDR

A classification system is proposed that can classify documents without human intervention and can incorporate both classification by subject and by other forms of bibliographic information, allowing for the generalization of browsing to include all features of an information carrying unit.

Abstract:

A document classifier places documents together in a linear arrangement for browsing or high-speed access by human or computerized information retrieval systems. Requirements for document classification and browsing systems are developed from similarity measures, distance measures, and the notion of subject aboutness. A requirement that documents be arranged in decreasing order of similarity as the distance from a given document increases can often not be met. Based on these requirements, information-theoretic considerations, and the Gray code, a classification system is proposed that can classify documents without human intervention. It provides a theoretical justification for individual classification numbers going from broad to narrow topics when moving from left to right in the classification number. A general measure of classifier performance is developed, and used to evaluate experimental results comparing the distance between subject headings assigned to documents given classifications from the proposed system and the Library of Congress Classification (LCC) system. Browsing in libraries, hypertext, and databases is usually considered to be the domain of subject searches. The proposed system can incorporate both classification by subject and by other forms of bibliographic information, allowing for the generalization of browsing to include all features of an information carrying unit. © 1992 John Wiley & Sons, Inc.

Content maybe subject to copyright Report

A Gray Code Based Ordering

for Documents on Shelves:

Classiﬁcation for Browsing and Retrieval

Journal of the American Society for Information

Science

43(4) 1992, 312–322.

Robert M. Losee

University of North Carolina

Chapel Hill, NC 27599-3360 U.S.A.

losee@ils.unc.edu

June 28, 1998

Abstract

A document classiﬁer places documents together in a linear arrangement

for browsing or high speed access by human or computerized information

retrieval systems. Requirements for document classiﬁcation and browsing

systems are developed from similarity measures, distance measures, and the

notion of subject aboutness. A requirement that documents be arrangedin de-

creasing order of similarity as the distance from a given document increases

can often not be met. Based on these requirements, information theoretic

considerations, and the Gray code, a classiﬁcation system is proposed that

can classify documents without human intervention. It provides a theoretical

justiﬁcation for individual classiﬁcation numbers going from broad to narrow

topics when moving from left to right in the classiﬁcation number. A general

measure of classiﬁer performance is developed and used to evaluate experi-

mental results comparing the distance between subject headings assigned to

documents given classiﬁcations from the proposed system and the Library

of Congress Classiﬁcation (LCC) system. Browsing in libraries, hypertext,

and databases is usually considered to be the domain of subject searches.

The proposed system can incorporate both classiﬁcation by subject and by

other forms of bibliographic information, allowing for the generalization of

browsing to include all features of an information carrying unit.

1 Introduction

Mentioning document classiﬁcation may bring to a layperson’s mind the Dewey

Decimal system, while to the library professional, it usually provokes thoughts

of the problems associated with bringing together similar materials, the costs of

cataloging, or the difﬁculty of justifying one classiﬁcation system over another.

Svenonius (1981) has suggested that “the main use for classiﬁcation, at least in the

United States, has been to facilitate browsing.” While classiﬁcation has other roles

in the library, from the mundane to playing “a direct role in the creation of orig-

inal knowledge” [6], classiﬁcation as a tool to support browsing activities in both

libraries and database systems will be the focus of this research and discussion.

Although numerous library classiﬁcation systems have been developed, most have

been developed from philosophical and taxonomical considerations [12, 28]; few

have been based on more scientiﬁc criteria. A set of precise requirements for a

document classiﬁcation system are provided here as well as a classiﬁcation system

consistent with these requirements. This classiﬁcation system and an associated

measure of classiﬁcation performance have been developed based on concepts used

by information professionals who use document classiﬁcation systems.

Many different quantitative methods are available for determining similarities

between documents and for optimal ordering for documents. Methods similar to

the work here include studies of coding techniques that place similar documents

near each other, given a query [10], and hashing techniques that maintain order

[13]. An information theoretic method is developed here which we believe is easier

to interpret and consistent with the needs of document classiﬁcation systems. Other

similarity techniques and measures could be used and need to be explored, e.g., the

expected mutual information measure. The method developed here for evaluating

a classiﬁcation system is based on information naturalness concerns. Performance

measures using statistical correlation could be satisfactorily used, measuring, for

example, the correlation between the ordering provided by the classiﬁcation system

and the ordering provided by a perfectly ordered collection of documents.

Aristotle suggested that a science has as its base a series of predicates that, in

effect, deﬁne the science. For example, the science of physics uses fundamental

predicates such as position, velocity, and inertia on which it builds its theories.

If a science of document classiﬁcation is to be developed, it will probably have

at its base predicates like the shelf-distance between documents and the subject-

similarity of documents.

Developing a document classiﬁcation system results in a rather unique set of

problems. The primary function of a document or library classiﬁcation system is to

assign a value to each information carrying unit so that when the items are sorted

by this value, like information carrying units are grouped together. Co-location of

documents allows users to beneﬁt from browsing through nearby similar materials

once they have located a single potentially relevant item, increasing the precision of

the browsing. To provide for browsing capabilities, a library classiﬁcation system

should meet several requirements. Broad requirements for a classiﬁcation system

have been suggested by Wynar and Taylor [28]. We believe that a classiﬁcation

system supporting browsing should

assign classiﬁcation values objectively (objectivity requirement),

provide a single classiﬁcation system capable of classifying all possible doc-

uments (inclusion requirement),

provide a linear structure (linearity requirement),

assign values to documents so that when one moves away from any docu-

ment in either direction on a conceptual shelf or in a database, the docu-

ments become increasingly dissimilar (increasing distance-dissimilarity re-

quirement).

A classiﬁcation system meeting these requirements will fulﬁll the needs of a li-

brarian or database manager wishing to place like items together for browsing. In a

library, for example, similar documents may be consecutively retrieved by retriev-

ing books as one moves down a shelf.

Other characteristics of a classiﬁcation system, while desirable, are not manda-

tory. A classiﬁcation system might have the following characteristics:

be easily (quickly) searched,

be easy for librarians to use when classifying documents,

allow for classiﬁcation by computer,

be consistent with an existing, popular system,

provide explanatory power, or

be readily adaptable to incorporate changes in the materials classiﬁed.

These requirements and desirable characteristics, when combined with the pred-

icates of a science of classiﬁcation, will be used as the basis for the proposed

classiﬁcation system. A guide to the automatic classiﬁcation literature is provided

by [18].

A document or book has a number of features which may characterize the topic

of the document. Subject indicating features may include library subject headings,

while other features may indirectly indicate the document’s subject, such as words

that occur in a document’s title, the language, date and place of publication, and

characteristics of the author that might provide information about the subject of the

author’s work.

Features are assumed here to be binary. A feature is thus present or absent, or

, depending on the degree to which the document is about the feature. Below, the

expression “the probability of a feature” refers to the probability that the feature

has the value . Features are assumed to be treatable as statistically independent.

The primarypurpose of a classiﬁcationsystem is to enable informationsearchers

to browse through documents. Once an initial document has been located on a

shelf in a library or in a window in a hypertext system, searchers often choose to

examine related materials [2, 5, 8, 22, 23]. Classiﬁcation provides this clustering

of similar materials [9, 17]. Classiﬁcation and browsing systems typically group

items by subject similarity, but clustering procedures may also take into consid-

eration bibliographic features not normally thought of as subject related, such as

type of binding. Although the classiﬁcation system discussed below is capable of

incorporating bibliographic features used in known item searches and is not limited

to conventional subject clustering, further research will be necessary to determine

how useful classiﬁcation by other than subject-features would be to library patrons

and database searchers.

Below, a Gray code based classiﬁcation system is used to group a set of docu-

ments together. The classiﬁcation procedure groups documents based on the doc-

uments’ subject-bearing features. A measure is proposed which can be used to

evaluate the performance of a classiﬁcation system. Experiments suggest that the

Gray code based classiﬁcation method places documents closer together than does

the Library of Congress Classiﬁcation system.

2 Shelf-Distance

For notational simpliﬁcation, a classiﬁcation system is understood as an ordering

of a series of documents or text fragments on a single conceptual shelf holding

documents. A document or book is denoted as , with the subscript indicating the

position of the document on the shelf. The documents are ordered

where the subscript indicates the position of the document relative to the leftmost

document on the shelf, with position representing an arbitrarily chosen position

under consideration. is a hypothetical document with no subject content of any

sort at the left end of the shelf.

Distance measures in most common geometric spaces, such as a conceptual

document shelf, must meet several criteria, including:

where the distance between and is denoted as for all and .

It is always the case that for all , , and when

or when . Unless indicated otherwise, it is assumed that

the distance function, as well as other functions, are only deﬁned over the set of

documents to the right of and over the set of documents to the left of but

not over the set of all documents taken together. Therefore, the distance between

and is not deﬁned if is between and .

3 Distance and Dissimilarity

The dissimilarity function, or distance in a conceptual subject space between two

documents and , is denoted as . The subject of a document, what

it is about, is considered to be determined by all subject related aspects or features

of the document. The subject-dissimilarity may be computed as a function of the

degree of difference in feature values between two documents.

Because the dissimilarity function may be understood as a distance measure in

a conceptual space, the following holds:

A classiﬁcation system may attempt to maximize or minimize factors combin-

ing the distances between documents both in physical space (shelf-distance) and

subject similarity or dissimilarity between documents. Numerous techniques for

combining distance and similarity measures are used in mathematical clustering

procedures [1].

A classiﬁcation function consistent with the requirements for a document clas-

siﬁcation system, and in particular, the increasing distance-dissimilarity require-

ment, mandates that items be placed in weakly ascending order by the value of a

subject-dissimilarity measure as one moves out in either direction from any given

document. Weakly ascending order implies that the value for a selected feature

HTML Viewer

Figures

Figure 2: Four documents arranged in order by the value of their position in space.

Table 6: Order of features as suggested by theory, in the reverse order of that suggested by the theory, and in alphabetical order. Q measure based on Hamming distance and the information dissimilarity measures.

Figure 1: Four documents in a conceptual subject space.

Table 2: Document ordering with binary and Gray codes. Hamming distances represent the distance from a document (with the distance indicated) to the adjacent document immediately above it, understood as the document immediately to its left on a shelf. The average Hamming distance between adjacent documents using the

Table 4: Comparison of inter-document distances in LC and Gray code based classification systems.

Citations

PDF

Open Access

More filters

Journal Article

The Gray Code.

Robert W. Doran

- 01 Jan 2007 -

Journal of Universal Computer Science

TL;DR: The properties and algorithms of the Gray code are summarised, and some interesting applications of the code are also treated.

...read moreread less

Journal ArticleDOI

Constant-Weight Gray Codes for Local Rank Modulation

Eyal En Gad, +3 more

- 01 Nov 2011 -

IEEE Transactions on Information Theory

TL;DR: It is suggested that constant-weight Gray codes for the local rank-modulation scheme should be studied in order to simulate conventional multilevel flash cells while retaining the benefits of rank modulation.

...read moreread less

Journal ArticleDOI

The structure of single-track Gray codes

Moshe Schwartz, +1 more

- 01 Nov 1999 -

IEEE Transactions on Information Theory

TL;DR: An iterative construction for binary single-track Gray codes which are asymptotically optimal if an infinite family of asymptic optimal seed-codes exists and is based on an effective way to generate a large set of distinct necklaces and a merging method for cyclic Gray codes based on necklace representatives.

...read moreread less

Proceedings ArticleDOI

Market basket analysis of library circulation data

Sally Jo Cunningham, +1 more

TL;DR: The a-priori market basket tool is applied to the task of detecting subject classification categories that co-occur in transaction records of books borrowed from a university library to provide insight into the degree of "scatter" that the classification scheme induces in a particular collection of documents.

...read moreread less

Journal ArticleDOI

Chip-integrated all-optical 4-bit Gray code generation based on silicon microring resonators.

Li Liu, +2 more

- 10 Aug 2015 -

Optics Express

TL;DR: By combining two independent resonant wavelengths of two MRRs and adjusting their powers in a certain order, all-optical 4-bit Gray code generation has been successfully demonstrated and the proposed integrated device is competent in on-chip all- optical communication and optical interconnection systems with significant advantages.

...read moreread less

Collapse

Frequently Asked Questions (12)

Q1. What are the contributions in "A gray code based ordering for documents on shelves: classification for browsing and retrieval journal of the american society for information science 43(4) 1992, 312–322" ?

In this paper, a classification system based on the Gray code is proposed to classify documents without human intervention, and a general measure of classifier performance is developed and used to evaluate experimental results comparing the distance between subject headings assigned to documents given classifications.

Q2. What is the difficulty of the method of analysis?

One difficulty associated with this method of analysis is that distances between documents are computed based solely on their subject headings.

Q3. What is the purpose of the proposed classification system?

The proposed classification system, consistent with a set of requirements for a classification system, can be used effectively to classify documents in library and database environments.

Q4. What are the five databases used for the proposed classification system?

The five databases used consist of bibliographic records retrieved from searches of the UNC-Chapel Hill Library’s on-line catalog.

Q5. What is the way to reduce the expected dissimilarity between features?

As the rightmost bits “cycle” more frequently than bits to the left, this arrangement will result in a lower expected dissimilarity between adjacent documents than would be the case if the least probable features with the greatest expected dissimilarity were on the right side and cycled most frequently.

Q6. What is the advantage of using a theoretically based model for a classification system?

One of the advantages of using a theoretically based model for a classification system is that classification performance can be analyzed.

Q7. How many binary features can be extracted from an existing classification system?

If the first character of an existing classification code has one of twenty six possible characters, the first twenty six binary feature values of the proposed system could be extracted and treated as a single character for comparison.

Q8. Why is the greater increase in performance due to the combined database?

The greater increase in performance for the combined database is due in part to the increase in , but is also due to the increased effectiveness of both classification systems when database size increases.

Q9. What is the effect of the dissimilarity function on a conceptual shelf?

This document arrangement has the effect of ordering documents so that at any one point on the conceptual shelf, the documents to both the right and the left of the point are arranged in order of increasing dissimilarity to the document at the chosen point.

Q10. What is the way to extend the search beyond subject searching?

The proposed system can incorporate both classification by subject and by incorporation of non-subject related bibliographic features, allowing for the extension of browsing beyond subject searching.

Q11. How can the expected dissimilarity between features be reduced?

the expected dissimilarity between documents, as represented by the sum of the expected dissimilarity between features, can be decreased by placing those features with the least expected dissimilarity furthest to the right in the code, while the features with the greatest expected dissimilarity are placed to the left.

Q12. What are the methods used to determine similarities between documents?

Methods similar to the work here include studies of coding techniques that place similar documents near each other, given a query [10], and hashing techniques that maintain order [13].

A Gray code based ordering for documents on shelves: Classification for browsing and retrieval

Figures

Citations

The Gray Code.

Constant-Weight Gray Codes for Local Rank Modulation

The structure of single-track Gray codes

Market basket analysis of library circulation data

Chip-integrated all-optical 4-bit Gray code generation based on silicon microring resonators.

Related Papers (5)

Gray codes and paths on the N-cube

A Survey of Combinatorial Gray Codes

Gray codes for partial match and range queries

Gray codes for randomization procedures

Subcube allocation and task migration in hypercube multiprocessors

Frequently Asked Questions (12)

Q1. What are the contributions in "A gray code based ordering for documents on shelves: classification for browsing and retrieval journal of the american society for information science 43(4) 1992, 312–322" ?

Q2. What is the difficulty of the method of analysis?

Q3. What is the purpose of the proposed classification system?

Q4. What are the five databases used for the proposed classification system?

Q5. What is the way to reduce the expected dissimilarity between features?

Q6. What is the advantage of using a theoretically based model for a classification system?

Q7. How many binary features can be extracted from an existing classification system?

Q8. Why is the greater increase in performance due to the combined database?

Q9. What is the effect of the dissimilarity function on a conceptual shelf?

Q10. What is the way to extend the search beyond subject searching?

Q11. How can the expected dissimilarity between features be reduced?

Q12. What are the methods used to determine similarities between documents?