scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Plan2Text: A framework for describing building floor plan images from first person perspective

TL;DR: It is demonstrated that the proposed end-to-end framework for first person vision based textual description synthesis of building floor plans gives state of the art performance on challenging, real-world floor plan images.
Abstract: We focus on synthesis of textual description from a given building floor plan image based on the first-person vision perspective. Tasks like symbol spotting, wall and decor segmentation, semantic and perceptual segmentation has been done in the past on floor plans. Here, for the first time, we propose an end-to-end framework for first person vision based textual description synthesis of building floor plans. We demonstrate (qualitative and quantitatively) that the proposed framework gives state of the art performance on challenging, real-world floor plan images. Potential application of this work could be understanding floor plans, stability analysis of buildings, and retrieval.
Citations
More filters
Proceedings ArticleDOI
01 Sep 2019
TL;DR: An extensive experimental study is presented for tasks like furniture localization in a floor plan, caption and description generation, on the proposed dataset showing the utility of BRIDGE.
Abstract: In this paper, a large scale public dataset containing floor plan images and their annotations is presented. BRIDGE (Building plan Repository for Image Description Generation, and Evaluation) dataset contains more than 13000 images of the floor plan and annotations collected from various websites, as well as publicly available floor plan images in the research domain. The images in BRIDGE also has annotations for symbols, region graphs, and paragraph descriptions. The BRIDGE dataset will be useful for symbol spotting, caption and description generation, scene graph synthesis, retrieval and many other tasks involving building plan parsing. In this paper, we also present an extensive experimental study for tasks like furniture localization in a floor plan, caption and description generation, on the proposed dataset showing the utility of BRIDGE.

11 citations


Additional excerpts

  • ...In [14], [15], authors have used handcrafted features for identifying decor symbol, room information and generating region wise caption generation....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors propose a framework called Sugaman (Supervised and Unified framework using Grammar and Annotation Model for Access and Navigation) for describing a floor plan and giving direction for obstacle-free movement within a building.
Abstract: In this study, the authors propose a framework SUGAMAN (Supervised and Unified framework using Grammar and Annotation Model for Access and Navigation). SUGAMAN is a Hindi word meaning ‘easy passage from one place to another’. SUGAMAN synthesises textual description from a given floor plan image, usable by visually impaired to navigate by understanding the arrangement of rooms and furniture. It is the first framework for describing a floor plan and giving direction for obstacle-free movement within a building. The model learns five classes of room categories from 1355 room image samples under a supervised learning paradigm. These learned annotations are fed into a description synthesis framework to yield a holistic description of a floor plan image. Authors demonstrate the performance of various supervised classifiers on room learning and provided a comparative analysis of system generated and human-written descriptions. The contribution of this study includes a novel framework for description generation from document images with graphics while proposing a new feature representing the floor plans, text annotations for a publicly available data set, and an algorithm for door to door obstacle avoidance navigation. This work can be applied to areas like understanding floor plans and design of historical monuments, and retrieval.

8 citations

Proceedings ArticleDOI
15 Oct 2018
TL;DR: This work proposes an end to end framework (ASYSST) for textual description synthesis from digitized building floor plans and introduces a novel Bag of Decor feature to learn $5$ classes of a room from $1355$ samples under a supervised learning paradigm.
Abstract: In an indoor scenario, the visually impaired do not have the information about the surroundings and finds it difficult to navigate from room to room. The sensor-based solutions are expensive and may not always be comfortable for the end users. In this paper, we focus on the problem of synthesis of textual description from a given floor plan image to assist the visually impaired. The textual description, in addition to a text reading software, can aid the visually impaired person while moving inside a building. In this work, for the first time, we propose an end to end framework (ASYSST) for textual description synthesis from digitized building floor plans. We have introduced a novel Bag of Decor (BoD) feature to learn $5$ classes of a room from $1355$ samples under a supervised learning paradigm. These learned labels are fed into a description synthesis framework to yield a holistic description of a floor plan image. Experimental analysis of real publicly available floor plan data-set proves the superiority of our framework.

4 citations


Cites background from "Plan2Text: A framework for describi..."

  • ...Very recently, attempts were made to extend the same for document images also [8]....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors proposed two models, description synthesis from image cue (DSIC) and transformer-based description generation (TBDG), for text generation from floor plan images.
Abstract: Image captioning is a widely known problem in the area of AI. Caption generation from floor plan images has applications in indoor path planning, real estate, and providing architectural solutions. Several methods have been explored in the literature for generating captions or semi-structured descriptions from floor plan images. Since only the caption is insufficient to capture fine-grained details, researchers also proposed descriptive paragraphs from images. However, these descriptions have a rigid structure and lack flexibility, making it difficult to use them in real-time scenarios. This paper offers two models, description synthesis from image cue (DSIC) and transformer-based description generation (TBDG), for text generation from floor plan images. These two models take advantage of modern deep neural networks for visual feature extraction and text generation. The difference between both models is in the way they take input from the floor plan image. The DSIC model takes only visual features automatically extracted by a deep neural network, while the TBDG model learns textual captions extracted from input floor plan images with paragraphs. The specific keywords generated in TBDG and understanding them with paragraphs make it more robust in a general floor plan image. Experiments were carried out on a large-scale publicly available dataset and compared with state-of-the-art techniques to show the proposed model’s superiority.

4 citations

Journal ArticleDOI
TL;DR: A Mask R-CNN-based semi-supervised approach that provides pixel-to-pixel alignment to generate individual annotation masks for each class to mine the inter-class similarity in order to detect more accurate objects with less labeled data is presented.
Abstract: Research has been growing on object detection using semi-supervised methods in past few years. We examine the intersection of these two areas for floor-plan objects to promote the research objective of detecting more accurate objects with less labeled data. The floor-plan objects include different furniture items with multiple types of the same class, and this high inter-class similarity impacts the performance of prior methods. In this paper, we present Mask R-CNN-based semi-supervised approach that provides pixel-to-pixel alignment to generate individual annotation masks for each class to mine the inter-class similarity. The semi-supervised approach has a student–teacher network that pulls information from the teacher network and feeds it to the student network. The teacher network uses unlabeled data to form pseudo-boxes, and the student network uses both label data with the pseudo boxes and labeled data as the ground truth for training. It learns representations of furniture items by combining labeled and label data. On the Mask R-CNN detector with ResNet-101 backbone network, the proposed approach achieves a mAP of 98.8%, 99.7%, and 99.8% with only 1%, 5% and 10% labeled data, respectively. Our experiment affirms the efficiency of the proposed approach, as it outperforms the previous semi-supervised approaches using only 1% of the labels.

4 citations

References
More filters
Journal ArticleDOI
TL;DR: A generalized gray-scale and rotation invariant operator presentation that allows for detecting the "uniform" patterns for any quantization of the angular space and for any spatial resolution and presents a method for combining multiple operators for multiresolution analysis.
Abstract: Presents a theoretically very simple, yet efficient, multiresolution approach to gray-scale and rotation invariant texture classification based on local binary patterns and nonparametric discrimination of sample and prototype distributions. The method is based on recognizing that certain local binary patterns, termed "uniform," are fundamental properties of local image texture and their occurrence histogram is proven to be a very powerful texture feature. We derive a generalized gray-scale and rotation invariant operator presentation that allows for detecting the "uniform" patterns for any quantization of the angular space and for any spatial resolution and presents a method for combining multiple operators for multiresolution analysis. The proposed approach is very robust in terms of gray-scale variations since the operator is, by definition, invariant against any monotonic transformation of the gray scale. Another advantage is computational simplicity as the operator can be realized with a few operations in a small neighborhood and a lookup table. Experimental results demonstrate that good discrimination can be achieved with the occurrence statistics of simple rotation invariant local binary patterns.

14,245 citations


"Plan2Text: A framework for describi..." refers methods in this paper

  • ...For each of the given canonical texture of material, a mask is calculated and features are extracted using Local Binary Patterns (LBP)(A detailed introduction of LBP can be found in [13] ), and a normalized histogram (HB ,HC ,HW ) is generated....

    [...]

  • ...In [13], a method for recognizing “uniform” binary patterns was proposed....

    [...]

Proceedings Article
25 Jul 2004
TL;DR: Four different RouGE measures are introduced: ROUGE-N, ROUge-L, R OUGE-W, and ROUAGE-S included in the Rouge summarization evaluation package and their evaluations.
Abstract: ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. This paper introduces four different ROUGE measures: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S included in the ROUGE summarization evaluation package and their evaluations. Three of them have been used in the Document Understanding Conference (DUC) 2004, a large-scale summarization evaluation sponsored by NIST.

9,293 citations


"Plan2Text: A framework for describi..." refers methods in this paper

  • ...However, to quantify the quality of the description synthesis we have computed the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [20] score....

    [...]

  • ...Then we compute the ROUGE score by taking average of corpus level scores....

    [...]

  • ...ROUGE scores (Tab....

    [...]

  • ...For ROUGE evaluation we invite volunteers to write description of the floor plan image....

    [...]

Journal ArticleDOI
TL;DR: A new superpixel algorithm is introduced, simple linear iterative clustering (SLIC), which adapts a k-means clustering approach to efficiently generate superpixels and is faster and more memory efficient, improves segmentation performance, and is straightforward to extend to supervoxel generation.
Abstract: Computer vision applications have come to rely increasingly on superpixels in recent years, but it is not always clear what constitutes a good superpixel algorithm. In an effort to understand the benefits and drawbacks of existing methods, we empirically compare five state-of-the-art superpixel algorithms for their ability to adhere to image boundaries, speed, memory efficiency, and their impact on segmentation performance. We then introduce a new superpixel algorithm, simple linear iterative clustering (SLIC), which adapts a k-means clustering approach to efficiently generate superpixels. Despite its simplicity, SLIC adheres to boundaries as well as or better than previous methods. At the same time, it is faster and more memory efficient, improves segmentation performance, and is straightforward to extend to supervoxel generation.

7,849 citations


"Plan2Text: A framework for describi..." refers methods in this paper

  • ...For material segmentation, we compared our algorithm with Simple Linear Iterative Clustering (SLIC) [19]....

    [...]

  • ...IV to show that our algorithm (in bold letters) for material segmentation outperformed SLIC....

    [...]

Journal ArticleDOI
TL;DR: In this paper, Satorra and Bentler's scaling corrections are used to improve the chi-square approximation of goodness-of-fit test statistics in small samples, large models, and nonnormal data.
Abstract: A family of scaling corrections aimed to improve the chi-square approximation of goodness-of-fit test statistics in small samples, large models, and nonnormal data was proposed in Satorra and Bentler (1994). For structural equations models, Satorra-Bentler's (SB) scaling corrections are available in standard computer software. Often, however, the interest is not on the overall fit of a model, but on a test of the restrictions that a null model sayM 0 implies on a less restricted oneM 1. IfT 0 andT 1 denote the goodness-of-fit test statistics associated toM 0 andM 1, respectively, then typically the differenceT d =T 0−T 1 is used as a chi-square test statistic with degrees of freedom equal to the difference on the number of independent parameters estimated under the modelsM 0 andM 1. As in the case of the goodness-of-fit test, it is of interest to scale the statisticT d in order to improve its chi-square approximation in realistic, that is, nonasymptotic and nonormal, applications. In a recent paper, Satorra (2000) shows that the difference between two SB scaled test statistics for overall model fit does not yield the correct SB scaled difference test statistic. Satorra developed an expression that permits scaling the difference test statistic, but his formula has some practical limitations, since it requires heavy computations that are not available in standard computer software. The purpose of the present paper is to provide an easy way to compute the scaled difference chi-square statistic from the scaled goodness-of-fit test statistics of modelsM 0 andM 1. A Monte Carlo study is provided to illustrate the performance of the competing statistics.

4,011 citations


"Plan2Text: A framework for describi..." refers background in this paper

  • ...where, k ∈ {B,W,C}, and Dk is the distance [17] between a pair of histograms of material k, β and ξ are the bins corresponding to histograms of current segment s and material k respectively, and n is the number of bins....

    [...]

Journal ArticleDOI
TL;DR: A new heuristic for feature detection is presented and, using machine learning, a feature detector is derived from this which can fully process live PAL video using less than 5 percent of the available processing time.
Abstract: The repeatability and efficiency of a corner detector determines how likely it is to be useful in a real-world application. The repeatability is important because the same scene viewed from different positions should yield features which correspond to the same real-world 3D locations. The efficiency is important because this determines whether the detector combined with further processing can operate at frame rate. Three advances are described in this paper. First, we present a new heuristic for feature detection and, using machine learning, we derive a feature detector from this which can fully process live PAL video using less than 5 percent of the available processing time. By comparison, most other detectors cannot even operate at frame rate (Harris detector 115 percent, SIFT 195 percent). Second, we generalize the detector, allowing it to be optimized for repeatability, with little loss of efficiency. Third, we carry out a rigorous comparison of corner detectors based on the above repeatability criterion applied to 3D scenes. We show that, despite being principally constructed for speed, on these stringent tests, our heuristic detector significantly outperforms existing feature detectors. Finally, the comparison demonstrates that using machine learning produces significant improvements in repeatability, yielding a detector that is both very fast and of very high quality.

1,847 citations


"Plan2Text: A framework for describi..." refers methods in this paper

  • ...However, in [9], the image was divided into blocks, and blocks with higher density are considered as text using FAST [10] method....

    [...]