scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Knowledge-driven description synthesis for floor plan interpretation

TL;DR: In this paper, the authors proposed two models, description synthesis from image cue (DSIC) and transformer-based description generation (TBDG), for text generation from floor plan images.
Abstract: Image captioning is a widely known problem in the area of AI. Caption generation from floor plan images has applications in indoor path planning, real estate, and providing architectural solutions. Several methods have been explored in the literature for generating captions or semi-structured descriptions from floor plan images. Since only the caption is insufficient to capture fine-grained details, researchers also proposed descriptive paragraphs from images. However, these descriptions have a rigid structure and lack flexibility, making it difficult to use them in real-time scenarios. This paper offers two models, description synthesis from image cue (DSIC) and transformer-based description generation (TBDG), for text generation from floor plan images. These two models take advantage of modern deep neural networks for visual feature extraction and text generation. The difference between both models is in the way they take input from the floor plan image. The DSIC model takes only visual features automatically extracted by a deep neural network, while the TBDG model learns textual captions extracted from input floor plan images with paragraphs. The specific keywords generated in TBDG and understanding them with paragraphs make it more robust in a general floor plan image. Experiments were carried out on a large-scale publicly available dataset and compared with state-of-the-art techniques to show the proposed model’s superiority.
Citations
More filters
Journal ArticleDOI
TL;DR: In this article , the authors provide a critical revision of the methodologies and tools from rule-based and learning-based approaches between the years 1995 to 2021, and conclude that most research relies on a particular plan style, facing problems regarding generalization and comparison due to the lack of a standard metric and limited public datasets.

9 citations

Journal ArticleDOI
TL;DR: In this article, the authors used 3DPlanNet Ensemble methods incorporating rule-based heuristic methods to learn with only a small amount of data (30 floor plan images), which produced a wall accuracy of more than 95% and an object accuracy similar to that of a previous study using a large amount of learning data.
Abstract: Research on converting 2D raster drawings into 3D vector data has a long history in the field of pattern recognition. Prior to the achievement of machine learning, existing studies were based on heuristics and rules. In recent years, there have been several studies employing deep learning, but a great effort was required to secure a large amount of data for learning. In this study, to overcome these limitations, we used 3DPlanNet Ensemble methods incorporating rule-based heuristic methods to learn with only a small amount of data (30 floor plan images). Experimentally, this method produced a wall accuracy of more than 95% and an object accuracy similar to that of a previous study using a large amount of learning data. In addition, 2D drawings without dimension information were converted into ground truth sizes with an accuracy of 97% or more, and structural data in the form of 3D models in which layers were divided for each object, such as walls, doors, windows, and rooms, were created. Using the 3DPlanNet Ensemble proposed in this study, we generated 110,000 3D vector data with a wall accuracy of 95% or more from 2D raster drawings end to end.

4 citations

Journal ArticleDOI
TL;DR: In this paper , the authors extracted data on architectural heritage automatically and structured it around spatial expression so that it can function as base work for mass content creation, and each derived result was mapped to Indoor Affordance Spaces to test whether information inference is possible based on the interconnection relationship.
Abstract: Recent developments in experience technologies such as augmented reality (AR)/virtual reality (VR) have facilitated receiving content about the audience on site and experiencing architectural heritage in a virtual space. Despite the development of experience devices, if the quantity and quality of content are not sufficiently supported, immersive user experiences are bound to be limited. Considerable amounts of money, manpower, and time are required to make a building into experience content. Tasks such as building a database create experiential content that occupies a large proportion of the overall process. Therefore, it is necessary to devise an automated method for building data, which is the basis for content creation. This study extracted data on architectural heritage automatically and structured it around spatial expression so that it can function as base work for mass content creation. Specifically, this study devised a method to link and structure text and spatial data centering on the architectural spatial data model. Text and spatial data were extracted automatically using deep learning, and each derived result was mapped to Indoor Affordance Spaces—an indoor spatial data model—to test whether information inference is possible based on the interconnection relationship. The spatial experience route inferred using the data model expresses the detailed area where the viewing element exists, based on the description method of the model. It also shows the process of reconstructing an efficient movement line with topological relationships between spaces. The series of processes presented herein showed sufficient applicability to the extraction of data and the connection and utilization of data models. This is useful for extracting and classifying information used for content from massive raw data. This study also considered the specificity arising from architectural heritage and spatial information. Therefore, the research concept can be applied in exhibition and experience spaces, such as architectural heritage, museums, and art galleries, to create sources for content creation and refer to content composition.
References
More filters
Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations

Book ChapterDOI
06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Abstract: We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

30,462 citations

Proceedings ArticleDOI
27 Jun 2016
TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
Abstract: We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

27,256 citations

Proceedings ArticleDOI
01 Dec 2001
TL;DR: A machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates and the introduction of a new image representation called the "integral image" which allows the features used by the detector to be computed very quickly.
Abstract: This paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates. This work is distinguished by three key contributions. The first is the introduction of a new image representation called the "integral image" which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features from a larger set and yields extremely efficient classifiers. The third contribution is a method for combining increasingly more complex classifiers in a "cascade" which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guarantees that discarded regions are unlikely to contain the object of interest. In the domain of face detection the system yields detection rates comparable to the best previous systems. Used in real-time applications, the detector runs at 15 frames per second without resorting to image differencing or skin color detection.

18,620 citations

Proceedings ArticleDOI
Ross Girshick1
07 Dec 2015
TL;DR: Fast R-CNN as discussed by the authors proposes a Fast Region-based Convolutional Network method for object detection, which employs several innovations to improve training and testing speed while also increasing detection accuracy and achieves a higher mAP on PASCAL VOC 2012.
Abstract: This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.

14,824 citations