scispace - formally typeset
Search or ask a question
Book ChapterDOI

Mining Ambiguities Using Pixel-Based Content Extraction

TL;DR: The present study focuses on extraction of information from the available text and media-type data after it is converted into digital form using the basic pixel map representation of data and converting them through numerical means, so that issues of language, text script and format do not pose problems.
Abstract: Internet and mobile computing have become a major societal force in that down-to-earth issues are being addressed and sorted out whether they relate to online shopping or securing driving information in unknown places. Here the major concern of communication is that the Web content should reach the user in a short period of time. So information extraction needs to be at a basic level and easier to implement without depending on any major software. The present study focuses on extraction of information from the available text and media-type data after it is converted into digital form. The approach uses the basic pixel map representation of data and converting them through numerical means, so that issues of language, text script and format do not pose problems. With the numerically converted data, key clusters similar to keywords used in any search method are developed and content is extracted through different approaches making it computation-intensive for easiness. One approach is that statistical features of the images are extracted from the pixel map of the image. The extracted features are presented to the fuzzy clustering algorithm. The similarity metric being Euclidean distance and the accuracy is compared and presented. The concept of ambiguity is introduced in the paper, by comparing objects like ‘computer,’ which have explicit content representation possible to an abstract subject like ‘soft-computing,’ where vagueness and ambiguity are possible in representation. With this as the objective, the approach used for content extraction is compared and how within certain bounds it could be possible to extract the content.
Citations
More filters
Journal ArticleDOI
TL;DR: The proposed method to process large data in the context of content extraction in a heterogeneous web page containing multi-lingual information, in the Indian context is extended and concludes how complex from computing point of view to extract the knowledge and essence of some texts like Bhagavad Gita is.
Abstract: Deep learning is becoming increasingly necessary as data, information and web and mobile interaction are proliferating with form, type and volume of data becoming easy to store, retrieve and process. This necessity is felt not only in science and engineering but also in social and commercial internet activity. The paper explores some ideas of deep learning to process large data in the context of content extraction in a heterogeneous web page containing multi-lingual information, in the Indian context. This basis is used to explore how Indian heritage over several thousands of years, is able to maintain information and knowledge in certain areas through oral, palm leaf and stone-cut data forms. The paper extends the proposed method to some heritage data and concludes how complex from computing point of view to extract the knowledge and essence of some texts like Bhagavad Gita.

Cites background or methods from "Mining Ambiguities Using Pixel-Base..."

  • ...The idea of pixel map processing is also extended to developing key clusters so that language and text anomalies can be taken care of [4] and some case studies on these aspects have been reported earlier [5,6] with applications different kinds of data having image language and text variations....

    [...]

  • ...After pixel processing and extracting attributes [3,4], bar charts of the wide variation are shown in Fig....

    [...]

Journal ArticleDOI
TL;DR: This study aims on extraction of information from the available data after the data is digitized, which is generic, uses pixel-maps of the data which is software and language independent.
Abstract: Objectives: Internet is the repository of information, which contains enormous information about the past, present which can be used to predict future. To know the unknown users are inclined towards searching the internet rather than referencing the library because of ease of availability. This requirement initiates the need to find the content of a web page with in shortest period of time irrespective of the form the page is. So information and content extraction need to be at a basic generic level and easier to implement without depending on any major software. Methods: The study aims on extraction of information from the available data after the data is digitized. The digitized data is converted to pixel- maps which are universal. The pixel map will not face the issues of the form and the format of the web page content. Statistical method is incorporated to extract the attributes of the images so that issues of language hence text-script and format do not pose problems, the extracted features are presented to the Back Propagation algorithm. Findings: The accuracy is presented and how the content extraction within certain bounds could be possible Tested using unstructured word sets chosen from web pages. The method is demonstrated for mono lingual, multi-lingual and transliterated documents so that the applicability is universal. Applications/Improvement: The method is generic, uses pixel-maps of the data which is software and language independent.
References
More filters
Proceedings ArticleDOI
20 May 2003
TL;DR: This work has developed a framework that employs easily extensible set of techniques that incorporate advantages of previous work on content extraction, and the key insight is to work with the DOM trees, rather than with raw HTML markup.
Abstract: Web pages often contain clutter (such as pop-up ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, and text summarization. Most approaches to removing clutter or making content more readable involve changing font size or removing HTML and data components such as images, which takes away from a webpage's inherent look and feel. Unlike "Content Reformatting", which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses "Content Extraction". We have developed a framework that employs easily extensible set of techniques that incorporate advantages of previous work on content extraction. Our key insight is to work with the DOM trees, rather than with raw HTML markup. We have implemented our approach in a publicly available Web proxy to extract content from HTML web pages.

392 citations

Proceedings ArticleDOI
01 Sep 2008
TL;DR: This work introduces content code blurring, a novel content extraction algorithm that identifies exactly these regions in an iterative process of identifying the main content and/or removing the additional contents in a web document.
Abstract: Most HTML documents on the world wide web contain far more than the article or text which forms their main content. Navigation menus, functional and design elements or commercial banners are typical examples of additional contents. Content extraction is the process of identifying the main content and/or removing the additional contents. We introduce content code blurring, a novel content extraction algorithm. As the main text content is typically a long, homogeneously formatted region in a web document, the aim is to identify exactly these regions in an iterative process. Comparing its performance with existing content extraction solutions we show thatfor most documents content code blurring delivers the best results.

78 citations