scispace - formally typeset
Search or ask a question

Showing papers on "Document layout analysis published in 2012"


Patent
27 Aug 2012
TL;DR: In this article, a document processing system for accurately and efficiently analyzing documents and methods for making and using the same is presented, where each incoming document includes at least one section of textual content and is provided as a paper-based document that is converted into an electronic form.
Abstract: A document processing system for accurately and efficiently analyzing documents and methods for making and using same. Each incoming document includes at least one section of textual content and is provided in an electronic form or as a paper-based document that is converted into an electronic form. Since many categories of documents, such as legal and accounting documents, often include one or more common text sections with similar textual content, the document processing system compares the documents to identify and classify the common text sections. The document comparison can be further enhanced by dividing the document into document segments and comparing the document segments; whereas, the conversion of paper-based documents likewise can be improved by comparing the resultant electronic document with a library of standard phrases, sentences, and paragraphs. The document processing system thereby enables an image of the document to be manipulated, as desired, to facilitate its review.

103 citations


Proceedings ArticleDOI
18 Sep 2012
TL;DR: This work introduces an approach that segments text appearing in page margins from manuscripts with complex layout format, independent of block segmentation, as well as pixel level analysis.
Abstract: Page layout analysis is a fundamental step of any document image understanding system. We introduce an approach that segments text appearing in page margins (a.k.a side-notes text) from manuscripts with complex layout format. Simple and discriminative features are extracted in a connected-component level and subsequently robust feature vectors are generated. Multilayer perception classifier is exploited to classify connected components to the relevant class of text. A voting scheme is then applied to refine the resulting segmentation and produce the final classification. In contrast to state-of-the-art segmentation approaches, this method is independent of block segmentation, as well as pixel level analysis. The proposed method has been trained and tested on a dataset that contains a variety of complex side-notes layout formats, achieving a segmentation accuracy of about 95%.

65 citations


Patent
29 Nov 2012
TL;DR: In this article, an ontology is extracted from external information sources that contain descriptions of particular domain objects, and the text information in the document is linked to ontology concepts and a document semantic model is built.
Abstract: The invention relates to processing of data during semantic analysis of text data and building of document semantic models. The method consists of two main steps. In step one, an ontology is extracted from external information sources that contain descriptions of particular domain objects. In step two, the text information in the document is linked to ontology concepts and a document semantic model is built. Electronic resources that may or may not be connected through a hyperlink structure are used as information sources. The technical result is achieved, in particular, by identifying all terms in the document and linking them to ontology concepts, so that each term correlates to one concept (its meaning), and then ranking term meanings by importance to the document.

42 citations


Patent
15 Mar 2012
TL;DR: In this paper, a process and system for facilitating mass translations of documents by translators via an online system which provides an interface for facilitating such translation is presented, which includes both text and non-text (e.g., images, graphs, charts) components.
Abstract: In accordance with some embodiments, processes and systems are provided for facilitating mass translations of documents by translators (e.g., freelance translators) via an online system which provides an interface for facilitating such translation. In accordance with one embodiments, the processes and systems provide for receiving an original document which includes both text and non-text (e.g., images, graphs, charts) components, extracting the text from the document, presenting the extracted text to a translator, receiving a translated text of the text and creating a translated version of the original document based on the received translated text and a layout of the original document, such that the aesthetic characteristics of the original document are generally preserved without the translator's efforts being distracted to such aesthetic characteristics and layout.

36 citations


Patent
06 Jul 2012
TL;DR: In this article, a method, a storage medium and a system for document content reconstruction are provided in a digital content delivery and online education services platform to enable delivery of textbooks and other copyrighted material to multi-platform web browser applications.
Abstract: A method, a storage medium and a system for document content reconstruction are provided in a digital content delivery and online education services platform to enable delivery of textbooks and other copyrighted material to multi-platform web browser applications. The method comprises ingesting a document page in an unstructured document format. The method further comprises extracting one or more images and metadata associated with the images and text and fonts associated with the texts from the document page. In addition, the method comprises coalescing text into paragraphs and creating a structured document page in a markup language format using the extracted images, text and fonts rendered with layout fidelity to the original ingested document page.

32 citations


Proceedings Article
11 Nov 2012
TL;DR: A local noise model for grayscale images that simulates degradations due to the age of the document itself and printing/writing process such as ink splotches, white specks or streaks is proposed.
Abstract: Kanungo noise model is widely used to test the robustness of different binary document image analysis methods towards noise. This model only works with binary images while most document images are in grayscale. Because binarizing a document image might degrade its contents and lead to a loss of information, more and more researchers are currently focusing on segmentation-free methods (Angelika et al [2]). Thus, we propose a local noise model for grayscale images. Its main principle is to locally degrade the image in the neighbourhoods of “seed-points” selected close to the character boundary. These points define the center of “noise regions”. The pixel values inside the noise region are modified by a Gaussian random distribution to make the final result more realistic. While Kanungo noise models scanning artifacts, our model simulates degradations due to the age of the document itself and printing/writing process such as ink splotches, white specks or streaks. It is very easy for users to parameterize and create a set of benchmark databases with an increasing level of noise. These databases will further be used to test the robustness of different grayscale document image analysis methods (i.e. text line segmentation, OCR, handwriting recognition).

27 citations


Proceedings ArticleDOI
01 Dec 2012
TL;DR: A new text line extraction technique based on Spiral Run Length Smearing Algorithm (SRLSA) is reported, where digitized document image is partitioned into a number of vertical fragments of equal width and text line segments present in these fragments are identified by applying SRLSA.
Abstract: Extraction of text lines from document images is one of the important steps in the process of an Optical Character Recognition (OCR) system In case of handwritten document images, presence of skewed, touching or overlapping text line(s) makes this process a real challenge to the researcher In the present work, a new text line extraction technique based on Spiral Run Length Smearing Algorithm (SRLSA) is reported Firstly, digitized document image is partitioned into a number of vertical fragments of equal width Then all the text line segments present in these fragments are identified by applying SRLSA Finally, the neighboring text line segments are analyzed and merged (if necessary) to place them inside the same text line boundary in which they actually belong For experimental purpose, the technique is tested on CMATERdb111 and CMATERdb121 databases The present technique extracts 8709% and 8935% text lines successfully from the said databases respectively

25 citations


Proceedings Article
01 Nov 2012
TL;DR: A learning framework that makes use of the Markov Random Field to improve the performance of the existing document image binarization methods for those degraded document images.
Abstract: Document image binarization is an important preprocessing technique for document image analysis that segments the text from the document image backgrounds. Many techniques have been proposed and successfully applied in different applications, such as document image retrieval. However, these techniques may perform poorly on degraded document images. In this paper, we propose a learning framework that makes use of the Markov Random Field to improve the performance of the existing document image binarization methods for those degraded document images. Extensive experiments on the recent Document Image Bina-rization Contest datasets demonstrate that significant improvements of the existing binarization methods when applying our proposed framework.

22 citations


Proceedings Article
01 Nov 2012
TL;DR: It is proved that using Relative Location Features improve the final segmentation on documents with a strong structure, while their application on unstructured documents does not show significant improvement.
Abstract: In this paper we evaluate the use of Relative Location Features (RLF) on a historical document segmentation task, and compare the quality of the results obtained on structured and unstructured documents using RLF and not using them. We prove that using these features improve the final segmentation on documents with a strong structure, while their application on unstructured documents does not show significant improvement. Although this paper is not focused on segmenting unstructured documents, results obtained on a benchmark dataset are equal or even overcome previous results of similar works.

19 citations


Journal ArticleDOI
TL;DR: Experimental and comparative results prove the effectiveness of the proposed knowledge-based system and its advantages in extracting text-lines with a large variety of illumination levels, sizes, and font styles from various types of mixed and overlapping text/graphics complex compound document images.
Abstract: This paper presents a new knowledge-based system for extracting and identifying text-lines from various real-life mixed text/graphics compound document images. The proposed system first decomposes the document image into distinct object planes to separate homogeneous objects, including textual regions of interest, non-text objects such as graphics and pictures, and background textures. A knowledge-based text extraction and identification method obtains the text-lines with different characteristics in each plane. The proposed system offers high flexibility and expandability by merely updating new rules to cope with various types of real-life complex document images. Experimental and comparative results prove the effectiveness of the proposed knowledge-based system and its advantages in extracting text-lines with a large variety of illumination levels, sizes, and font styles from various types of mixed and overlapping text/graphics complex compound document images.

17 citations


Patent
26 Mar 2012
TL;DR: In this paper, an energy model of the layout of the user-content components in the user document is generated based on original positions and sizes of the users in the template document.
Abstract: Methods and systems for optimizing a layout of a document constructed based on a template document, where the template document comprises a plurality of individually-specified components including one or more individually specified user-content components configured to receive user content from a user of the template document. An energy model of the layout of the user-content components in the user document is generated based on original positions and sizes of the user-content components in the template document. Positions of corresponding components in the user document are automatically adjusted to minimize the energy of the user-content component layout in to the user document.

Proceedings ArticleDOI
18 Sep 2012
TL;DR: The results from this research suggested that the proposed approach for practical data on palm leaf manuscripts has better performance in solving the line segmentation problem.
Abstract: Text line extraction is one of the critical steps in document analysis and optical character recognition (OCR) systems. The purpose of this study is to address the problem of text line extraction of ancient Thai manuscripts written on palm leaves, using an Adaptive Partial Projection (APP) technique by integrating a modified partial projection and smooth histogram with recursion. The proposed approach was compared with a Modified Partial Projection (MPP) looking at vowel analysis and touching components of two consecutive lines. The results from this research suggested that the proposed approach for practical data on palm leaf manuscripts has better performance in solving the line segmentation problem.

Patent
30 Mar 2012
TL;DR: In this article, the authors present a form layout tool that provides a flexible way to lay out forms on a web page by configuring a web configuration file with the location of form layout styles.
Abstract: A form layout system includes a form layout tool that provides a flexible way to lay out forms on a web page. The form layout tool configures a web configuration file with the location of form layout styles, and uses the form layout styles, a number of columns, a number of fields, and a “size” of each field to include in the component of a page layout to create a page layout for a target application. The form layout tool generates a revised application page with the created page layout by applying the form layout style to the created page layout.

Book ChapterDOI
01 Jan 2012
TL;DR: The presented system is based on a suitable combination of different well-established techniques for analyzing Latin script documents that have proven to be robust against different types of document image degradations.
Abstract: Layout analysis—extraction of text lines from a document image and identification of their reading order—is an important step in converting the document into a searchable electronic representation. Projection methods are typically employed for extraction of text lines in Arabic script documents. Although projection methods achieve good accuracy on clean, skew-free documents, their performance drops under challenging situations (border noise, skew, complex layouts, etc.). This chapter presents a layout analysis system for extracting text lines in reading order from scanned Arabic script document images written in different languages (Arabic, Urdu, Persian, etc.) and different styles (Naskh, Nastaliq, etc.). The presented system is based on a suitable combination of different well-established techniques for analyzing Latin script documents that have proven to be robust against different types of document image degradations.

Proceedings ArticleDOI
21 Mar 2012
TL;DR: English text extraction from blob in comic image using various methods to preserve the text and formatting during conversion process and provide high quality of text from the printed document.
Abstract: Text extraction from image is one of the complicated areas in digital image processing. It is a complex process to detect and recognize the text from comic image due to their various size, gray scale values, complex backgrounds and different styles of font. Text extraction process from comic image helps to preserve the text and formatting during conversion process and provide high quality of text from the printed document. This paper talks about English text extraction from blob in comic image using various methods.

Proceedings ArticleDOI
27 Mar 2012
TL;DR: Results show that the proposed skew estimation is comparable with state-of-the-art methods and outperforms them on a real dataset consisting of 658 snippets.
Abstract: Document analysis is done to analyze entire forms (e.g. intelligent form analysis, table detection) or to describe the layout/structure of a document for further processing. A pre-processing step of document analysis methods is a skew estimation of scanned or photographed documents. Current skew estimation methods require the existence of large text areas, are dependent on the text type and can be limited on a specific angle range. The proposed method is gradient based in combination with a Focused Nearest Neighbor Clustering of interest points and has no limitations regarding the detectable angle range. The upside/down decision is based on statistical analysis of ascenders and descenders. It can be applied to entire documents as well as to document fragments containing only a few words. Results show that the proposed skew estimation is comparable with state-of-the-art methods and outperforms them on a real dataset consisting of 658 snippets.

Patent
23 Jan 2012
TL;DR: In this paper, a fixed format document conversion engine and associated method for converting a fixed-format document into a flow format document is presented. Butler et al. use a sequence of layout analysis engines and semantic analysis engines to analyze the base physical layout information obtained from the fixed-formatted document to enrich, modify, and classify the layout information into progressively more advanced physical layout and semantic layout information.
Abstract: A fixed format document conversion engine and associated method for converting a fixed format document into a flow format document. The fixed format document conversion engine includes a sequence of layout analysis engines and semantic analysis engines to analyzes the base physical layout information obtained from the fixed format document to enrich, modify, and classify the physical layout information into progressively more advanced physical layout information and, ultimately, semantic layout information. The semantic layout information is mapped and serialized into a selected flow format document with a high level of flowability.

Proceedings ArticleDOI
05 Apr 2012
TL;DR: A mining model consists of sentence-based concept analysis, document-based Concept analysis, and corpus- based concept-analysis, which analyzes the term that contributes to the sentence semantics on the sentence, document, and Corpus levels rather than the traditional analysis of the document only.
Abstract: Classification plays a vital role in many information management and retrieval tasks. This paper studies classification of text document. Text classification is a supervised technique that uses labeled training data to learn the classification system and then automatically classifies the remaining text using the learned system. In this paper, we propose a mining model consists of sentence-based concept analysis, document-based concept analysis, and corpus-based concept-analysis. Then we analyze the term that contributes to the sentence semantics on the sentence, document, and corpus levels rather than the traditional analysis of the document only. After extracting feature vector for each new document, feature selection is performed. It is then followed by K-Nearest Neighbour classification. The approach enhances the text classification accuracy.

Proceedings ArticleDOI
16 Dec 2012
TL;DR: This paper performs layout analysis to detect words, lines, and paragraphs in the document image and seeks the geometric properties of the text blocks to detect and remove the margin noise.
Abstract: In this paper, we propose a technique for removing margin noise (both textual and non-textual noise) from scanned document images. We perform layout analysis to detect words, lines, and paragraphs in the document image. These detected elements are classified into text and non-text components on the basis of their characteristics (size, position, etc.). The geometric properties of the text blocks are sought to detect and remove the margin noise. We evaluate our algorithm on several scanned pages of Bengali literature books.

Journal ArticleDOI
TL;DR: Evaluation and experimental results demonstrate that the proposed text extraction method is independent from different document size, text size, font, shape, and is robust to Arabic document segmentation and text lines extraction.
Abstract: Text and not -text segmentation and text line extraction from document images are the most challenging problems of information indexing of Arabic document images such as books, technical articles, business letters and faxes in order to successfully process them in systems such as OCR. Researches on Arabic language related to documents digitization have been focusing on word and handwriting recognition. Few approaches have been proposed for layout analysis for Arabic scanned/captured documents. In this paper we present a page segmentation method that deals with the complexity of the Arabic language characteristics and fonts using the combination between two algorithms. The first method is the Run length Smoothing. The second method is the Connected Component Labeling algorithm for text and non-text classification using SVM. The combination of the two methods is based on Anding and Oring operations between the outputs of the two methods based on certain conditions. Then, dynamic horizontal projection based on dynamic updating of the threshold to commensurate with the noise associated with different documents and in between text lines. The performance evaluation is performed using manually generated ground truth representations from a dataset of Arabic document images captured using cameras and a hardware built for this purpose. Evaluation and experimental results demonstrate that the proposed text extraction method is independent from different document size, text size, font, shape, and is robust to Arabic document segmentation and text lines extraction. General Terms Image processing, Pattern Recognition.

Journal ArticleDOI
TL;DR: The proposed algorithm has been tested on several hundred documents that contain simple and complex page layout structures and contents and compared against state-of-the-art page-segmentation techniques with benchmark performance and results indicate that the methodology achieves an average of ∼89% classification accuracy in text, photo, and background regions.
Abstract: We propose a page layout analysis algorithm to classify a scanned document into different regions such as text, photo, or strong lines. The proposed scheme consists of five modules. The first module performs several image preprocessing techniques such as image scaling, filtering, color space conversion, and gamma correction to enhance the scanned image quality and reduce the computation time in later stages. Text detection is applied in the second module wherein wavelet transform and run-length encoding are employed to generate and validate text regions, respectively. The third module uses a Markov random field based block-wise segmentation that employs a basis vector projection technique with maximum a posteriori probability optimization to detect photo regions. In the fourth module, methods for edge detection, edge linking, line-segment fitting, and Hough transform are utilized to detect strong edges and lines. In the last module, the resultant text, photo, and edge maps are combined to generate a page layout map using K-Means clustering. The proposed algorithm has been tested on several hundred documents that contain simple and complex page layout structures and contents such as articles, magazines, business cards, dictionaries, and newsletters, and compared against state-of-the-art page-segmentation techniques with benchmark performance. The results indicate that our methodology achieves an average of ∼ 89% classification accuracy in text, photo, and background regions.

Patent
08 Mar 2012
TL;DR: A panoptic visualization document layout system includes a search engine and a layout engine coupled with metadata as mentioned in this paper, which is configured to select a layout model according to the associated metadata for the identified document component, which metadata further includes information identifying link(s) between a document component and one or more other document components.
Abstract: A panoptic visualization document layout system includes a search engine and a layout engine coupled thereto. The search engine is configured to identify a document component including requested media content from a panoptic visualization document collection having a plurality of document components each of which has associated metadata providing information about the respective document component. The layout engine is configured to select a layout model according to the associated metadata for the identified document component, which metadata further includes information identifying link(s) between the identified document component and one or more other document components. The layout engine is configured to retrieve document components including the identified document component and other document component(s), and generate a layout of panoptically-arranged visual representations of the retrieved document components according to the selected layout model, and the retrieved document components and associated metadata. And the layout engine is configured to communicate the layout.

Proceedings ArticleDOI
25 Oct 2012
TL;DR: The system for text extraction on the image taken by grabbing the content of the TV screen is presented and can be used in a functional verification system for test and functionally verify TV set operation.
Abstract: Optical character recognition (OCR) is a very active area of research and has become very successful in pattern recognition. It is based on algorithms for machine vision and artificial intelligence and used in developing algorithms for reading text on images, e.g. reading registration plates, scanned books and documents, etc. This paper presents the system for text extraction on the image taken by grabbing the content of the TV screen. The preparation steps for OCR are developed which detect the text regions in an image. An open-source algorithm for OCR is then run to read the text regions. The system is used as a part of Black Box Testing system in order to test and functionally verify TV set operation. After reading the text regions, comparison with the expected text is performed to make a final pass/fail decision for the test case. The system successfully read the text from the TV screen and can be used in a functional verification system.

Proceedings Article
24 Apr 2012
TL;DR: It is shown that text line detection can be accurately solved using a formal methodology, as opposed to most of the proposed heuristic approaches found in the literature.
Abstract: Document layout analysis is an important task needed for handwritten text recognition among other applications. Text layout commonly found in handwritten legacy documents is in the form of one or more paragraphs composed of parallel text lines. An approach for handwritten text line detection is presented which uses machine-learning techniques and methods widely used in natural language processing. It is shown that text line detection can be accurately solved using a formal methodology, as opposed to most of the proposed heuristic approaches found in the literature. Experimental results show the impact of using increasingly constrained "vertical layout language models" in text line detection accuracy.

Patent
Thomas Robert Park1
16 Jan 2012
TL;DR: In this article, a structured document is received, and the structural elements are parsed from the document to generate a text string representing the structure of the document instead of the semantic textual content.
Abstract: Technologies are described herein for classifying structured documents based on the structure of the document A structured document is received, and the structural elements are parsed from the document to generate a text string representing the structure of the document instead of the semantic textual content of the document The text string may be broken into N-grams utilizing a sliding window, and a classifier trained from similar structured documents labeled as belonging to one of a number of document classes is utilized to determine a probability that the document belongs to each of the document classes based on the N-grams

Journal ArticleDOI
Jorge Moraleda1
TL;DR: A method is presented that addresses image matching from partial blurry images by casting it as a problem of text retrieval, which allows it to leverage existing text document retrieval techniques and achieve efficiency and scalability similar to text search applications.

Patent
28 Dec 2012
TL;DR: In this paper, a method for authenticating a printed document which carries barcode that encode authentication data, including word bounding boxes for each word in the original document image and data for reconstructing the original image is presented.
Abstract: A method for authenticating a printed document which carries barcode that encode authentication data, including word bounding boxes for each word in the original document image and data for reconstructing the original image. The printed document is scanned to generate a target document image, which is then segmented into text words. The word bounding boxes of the original and target document images are used to align the target document image. Then, each word in the original document image is compared to corresponding words in the target document image using word difference map and Hausdorff distance between them. Symbols of the original document image are further compared to corresponding symbols in the target document image using feature comparison, symbol difference map and Hausdorff distance comparison, and point matching. These various comparison results can identify alterations in the target document with respect to the original document, which can be visualized.

Dissertation
06 Dec 2012
TL;DR: A number of improvements are demonstrated on separating text columns when one is situated very close to the other; on preventing the contents of a cell in a table to be merged with the contents in other adjacent cells; and on preventing regions inside a frame to be merge with other text regions around, especially side notes, even when the latter are written using a font similar to that the text body.
Abstract: Document page segmentation is one of the most crucial steps in document image analysis It ideally aims to explain the full structure of any document page, distinguishing text zones, graphics, photographs, halftones, figures, tables, etc Although to date, there have been made several attempts of achieving correct page segmentation results, there are still many difficulties The leader of the project in the framework of which this PhD work has been funded (*) uses a complete processing chain in which page segmentation mistakes are manually corrected by human operators Aside of the costs it represents, this demands tuning of a large number of parameters; moreover, some segmentation mistakes sometimes escape the vigilance of the operators Current automated page segmentation methods are well accepted for clean printed documents; but, they often fail to separate regions in handwritten documents when the document layout structure is loosely defined or when side notes are present inside the page Moreover, tables and advertisements bring additional challenges for region segmentation algorithms Our method addresses these problems The method is divided into four parts:1 Unlike most of popular page segmentation methods, we first separate text and graphics components of the page using a boosted decision tree classifier2 The separated text and graphics components are used among other features to separate columns of text in a two-dimensional conditional random fields framework3 A text line detection method, based on piecewise projection profiles is then applied to detect text lines with respect to text region boundaries4 Finally, a new paragraph detection method, which is trained on the common models of paragraphs, is applied on text lines to find paragraphs based on geometric appearance of text lines and their indentations Our contribution over existing work lies in essence in the use, or adaptation, of algorithms borrowed from machine learning literature, to solve difficult cases Indeed, we demonstrate a number of improvements : on separating text columns when one is situated very close to the other; on preventing the contents of a cell in a table to be merged with the contents of other adjacent cells; on preventing regions inside a frame to be merged with other text regions around, especially side notes, even when the latter are written using a font similar to that the text body Quantitative assessment, and comparison of the performances of our method with competitive algorithms using widely acknowledged metrics and evaluation methodologies, is also provided to a large extend(*) This PhD thesis has been funded by Conseil General de Seine-Saint-Denis, through the FUI6 project Demat-Factory, lead by Safig SA

Proceedings ArticleDOI
18 Sep 2012
TL;DR: The method was evaluated on the benchmarking dataset of the International Document Image Binarization Contest (DIBCO 2011) and show promising results.
Abstract: Document image binarization is an initial though critical stage towards the recognition of the text components of a document. This paper describes an efficient method based on mathematical morphology for extracting text regions from degraded handwritten document images. The basic stages of our approach are: a) top-hat-by-reconstruction to produce a filtered image with reasonable even background, b) region growing starting from a set of seed points and attaching to each seed similar intensity neighboring pixels and c) conditional extension of the initially detected text regions based on the values of the second derivative of the filtered image. The method was evaluated on the benchmarking dataset of the International Document Image Binarization Contest (DIBCO 2011) and show promising results.

Book ChapterDOI
13 Jul 2012
TL;DR: This paper investigates whether clustering of text documents can be improved if text segments of two documents are utilized, while calculating similarity between them, and examines effects of combining suggested inter- document similarities with traditional inter-document similarities following a simple approach.
Abstract: Measuring inter-document similarity is one of the most essential steps in text document clustering. Traditional methods rely on representing text documents using the simple Bag-of-Words (BOW) model. A document is an organized structure consisting of various text segments or passages. Such single term analysis of the text treats whole document as a single semantic unit and thus, ignores other semantic units like sentences, passages etc. In this paper, we attempt to take advantage of underlying subtopic structure of text documents and investigate whether clustering of text documents can be improved if text segments of two documents are utilized, while calculating similarity between them. We concentrate on examining effects of combining suggested inter-document similarities (based on inter-passage similarities) with traditional inter-document similarities following a simple approach for the same. Experimental results on standard data sets suggest improvement in clustering of text documents.