ASYSST: A Framework for Synopsis Synthesis Empowering Visually Impaired

doi:10.1145/3264856.3264859

Home
/
Papers
/
ASYSST: A Framework for Synopsis Synthesis Empowering Visually Impaired

Proceedings Article•DOI•

ASYSST: A Framework for Synopsis Synthesis Empowering Visually Impaired

Shreya Goyal¹, Chiranjoy Chattopadhyay¹, Gaurav Bhatnagar¹•Institutions (1)

Indian Institute of Technology, Jodhpur¹

15 Oct 2018-

TL;DR: This work proposes an end to end framework (ASYSST) for textual description synthesis from digitized building floor plans and introduces a novel Bag of Decor feature to learn $5$ classes of a room from $1355$ samples under a supervised learning paradigm.

read less

Abstract: In an indoor scenario, the visually impaired do not have the information about the surroundings and finds it difficult to navigate from room to room. The sensor-based solutions are expensive and may not always be comfortable for the end users. In this paper, we focus on the problem of synthesis of textual description from a given floor plan image to assist the visually impaired. The textual description, in addition to a text reading software, can aid the visually impaired person while moving inside a building. In this work, for the first time, we propose an end to end framework (ASYSST) for textual description synthesis from digitized building floor plans. We have introduced a novel Bag of Decor (BoD) feature to learn $5$ classes of a room from $1355$ samples under a supervised learning paradigm. These learned labels are fed into a description synthesis framework to yield a holistic description of a floor plan image. Experimental analysis of real publicly available floor plan data-set proves the superiority of our framework.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

BRIDGE: Building Plan Repository for Image Description Generation, and Evaluation

[...]

Shreya Goyal¹, Vishesh Mistry¹, Chiranjoy Chattopadhyay¹, Gaurav Bhatnagar¹•Institutions (1)

Indian Institute of Technology, Jodhpur¹

01 Sep 2019

TL;DR: An extensive experimental study is presented for tasks like furniture localization in a floor plan, caption and description generation, on the proposed dataset showing the utility of BRIDGE.

...read moreread less

Abstract: In this paper, a large scale public dataset containing floor plan images and their annotations is presented. BRIDGE (Building plan Repository for Image Description Generation, and Evaluation) dataset contains more than 13000 images of the floor plan and annotations collected from various websites, as well as publicly available floor plan images in the research domain. The images in BRIDGE also has annotations for symbols, region graphs, and paragraph descriptions. The BRIDGE dataset will be useful for symbol spotting, caption and description generation, scene graph synthesis, retrieval and many other tasks involving building plan parsing. In this paper, we also present an extensive experimental study for tasks like furniture localization in a floor plan, caption and description generation, on the proposed dataset showing the utility of BRIDGE.

...read moreread less

11 citations

Cites methods from "ASYSST: A Framework for Synopsis Sy..."

...In [14], [15], authors have used handcrafted features for identifying decor symbol, room information and generating region wise caption generation....
[...]
...1) Template based: Paragraph based descriptions are generated by using technique proposed in [14]....
[...]

Extending the flexibility of case-based design support tools: A use case in the architectural domain

[...]

Viktor Ayzenshtadt¹, Viktor Ayzenshtadt², Christoph Langenhan³, Syed Saqib Bukhari², Klaus-Dieter Althoff¹, Klaus-Dieter Althoff², Frank Petzold³, Andreas Dengel⁴, Andreas Dengel² - Show less +5 more•Institutions (4)

University of Hildesheim¹, German Research Centre for Artificial Intelligence², Technische Universität München³, Kaiserslautern University of Technology⁴

01 Jan 2017

TL;DR: In this article, the authors present results of a user study into extending the functionality of an existing casebased search engine for similar architectural designs to a flexible process-oriented case-based support tool for the architectural conceptualization phase.

...read moreread less

Abstract: This paper presents results of a user study into extending the functionality of an existing case-based search engine for similar architectural designs to a flexible process-oriented case-based support tool for the architectural conceptualization phase. Based on a research examining the target group’s (architects) thinking and working processes during the early conceptualization phase (especially during the search for similar architectural references), we identified common features for defining retrieval strategies for a more flexible case-based search for similar building designs within our system. Furthermore, we were also able to infer a definition for implementing these strategies into the early conceptualization process in architecture, that is, to outline a definition for this process as a wrapping structure for a user model. The study was conducted among the target group representatives (architects, architecture students and teaching personnel) by means of applying the paper prototyping method and Business Processing Model and Notation (BPMN). The results of this work are intended as a foundation for our upcoming research, but we also think it could be of wider interest for the case-based design research area.

...read moreread less

6 citations

Journal Article•DOI•

Knowledge-driven description synthesis for floor plan interpretation

[...]

Shreya Goyal¹, Chiranjoy Chattopadhyay¹, Gaurav Bhatnagar¹•Institutions (1)

Indian Institute of Technology, Jodhpur¹

26 Apr 2021-International Journal on Document Analysis and Recognition

TL;DR: In this paper, the authors proposed two models, description synthesis from image cue (DSIC) and transformer-based description generation (TBDG), for text generation from floor plan images.

...read moreread less

Abstract: Image captioning is a widely known problem in the area of AI. Caption generation from floor plan images has applications in indoor path planning, real estate, and providing architectural solutions. Several methods have been explored in the literature for generating captions or semi-structured descriptions from floor plan images. Since only the caption is insufficient to capture fine-grained details, researchers also proposed descriptive paragraphs from images. However, these descriptions have a rigid structure and lack flexibility, making it difficult to use them in real-time scenarios. This paper offers two models, description synthesis from image cue (DSIC) and transformer-based description generation (TBDG), for text generation from floor plan images. These two models take advantage of modern deep neural networks for visual feature extraction and text generation. The difference between both models is in the way they take input from the floor plan image. The DSIC model takes only visual features automatically extracted by a deep neural network, while the TBDG model learns textual captions extracted from input floor plan images with paragraphs. The specific keywords generated in TBDG and understanding them with paragraphs make it more robust in a general floor plan image. Experiments were carried out on a large-scale publicly available dataset and compared with state-of-the-art techniques to show the proposed model’s superiority.

...read moreread less

4 citations

Posted Content•

Knowledge driven Description Synthesis for Floor Plan Interpretation

[...]

Shreya Goyal¹, Chiranjoy Chattopadhyay¹, Gaurav Bhatnagar¹•Institutions (1)

Indian Institute of Technology, Jodhpur¹

15 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors proposed two models, Description Synthesis from Image Cue (DSIC) and Transformer Based Description Generation (TBDG), for floor plan image to text generation to fill the gaps in existing methods.

...read moreread less

Abstract: Image captioning is a widely known problem in the area of AI. Caption generation from floor plan images has applications in indoor path planning, real estate, and providing architectural solutions. Several methods have been explored in literature for generating captions or semi-structured descriptions from floor plan images. Since only the caption is insufficient to capture fine-grained details, researchers also proposed descriptive paragraphs from images. However, these descriptions have a rigid structure and lack flexibility, making it difficult to use them in real-time scenarios. This paper offers two models, Description Synthesis from Image Cue (DSIC) and Transformer Based Description Generation (TBDG), for the floor plan image to text generation to fill the gaps in existing methods. These two models take advantage of modern deep neural networks for visual feature extraction and text generation. The difference between both models is in the way they take input from the floor plan image. The DSIC model takes only visual features automatically extracted by a deep neural network, while the TBDG model learns textual captions extracted from input floor plan images with paragraphs. The specific keywords generated in TBDG and understanding them with paragraphs make it more robust in a general floor plan image. Experiments were carried out on a large-scale publicly available dataset and compared with state-of-the-art techniques to show the proposed model's superiority.

...read moreread less

References

PDF

Open Access

More filters

Proceedings Article•

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

[...]

Kelvin Xu¹, Jimmy Ba², Ryan Kiros², Kyunghyun Cho¹, Aaron Courville¹, Ruslan Salakhudinov², Ruslan Salakhudinov³, Rich Zemel², Rich Zemel³, Yoshua Bengio¹, Yoshua Bengio³ - Show less +7 more•Institutions (3)

Université de Montréal¹, University of Toronto², Canadian Institute for Advanced Research³

06 Jul 2015

TL;DR: An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.

...read moreread less

Abstract: Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr9k, Flickr30k and MS COCO.

...read moreread less

6,485 citations

"ASYSST: A Framework for Synopsis Sy..." refers background in this paper

...Due to introduction and growth of deep learning based approach, image synopsis/ description synthesis has become a very popular domain [2, 11, 23, 24]....
[...]

Posted Content•

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

[...]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio - Show less +4 more

10 Feb 2015-arXiv: Learning

TL;DR: This paper proposed an attention-based model that automatically learns to describe the content of images by focusing on salient objects while generating corresponding words in the output sequence, which achieved state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.

...read moreread less

5,896 citations

Proceedings Article•DOI•

Show and tell: A neural image caption generator

[...]

Oriol Vinyals¹, Alexander Toshev¹, Samy Bengio¹, Dumitru Erhan¹•Institutions (1)

Google¹

07 Jun 2015

TL;DR: In this paper, a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation is proposed to generate natural sentences describing an image, which can be used to automatically describe the content of an image.

...read moreread less

Abstract: Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU-1 score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU-1 score improvements on Flickr30k, from 56 to 66, and on SBU, from 19 to 28. Lastly, on the newly released COCO dataset, we achieve a BLEU-4 of 27.7, which is the current state-of-the-art.

...read moreread less

5,095 citations

Proceedings Article•DOI•

Deep visual-semantic alignments for generating image descriptions

[...]

Andrej Karpathy¹, Li Fei-Fei¹•Institutions (1)

Stanford University¹

07 Jun 2015

TL;DR: A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.

...read moreread less

Abstract: We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.

...read moreread less

3,996 citations

Posted Content•

Show and Tell: A Neural Image Caption Generator

[...]

Oriol Vinyals¹, Alexander Toshev¹, Samy Bengio¹, Dumitru Erhan¹•Institutions (1)

Google¹

17 Nov 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.

...read moreread less

3,426 citations

"ASYSST: A Framework for Synopsis Sy..." refers background in this paper

...Due to introduction and growth of deep learning based approach, image synopsis/ description synthesis has become a very popular domain [2, 11, 23, 24]....
[...]