scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

PicPose: Using Picture Posing for Localization Service on IoT Devices

TL;DR: A new picture-based localization service PicPose is presented that relies on the feature points extracted from a camera-captured image and conducts feature point matching with the original wall picture to conduct pose calculation, which is impossible for ArPico and ArUco.
Abstract: Device self-localization is an important capability for many IoT applications that require mobility in service capabilities. In our previous work, we have designed the ArPico method for robot indoor localization. By placing and recognizing pre-installed pictures on walls, robots can use low-cost cameras to identify their positions by referencing to pictures' precise locations. However, using ArPico, all pictures need to have clear rectangular borders for the pose computation. But some real-world pictures does not have clear thick borders. Moreover, some pictures may have odd shapes or are only partially visible. To address these problems, a new picture-based localization service PicPose is presented. PicPose relies on the feature points extracted from a camera-captured image and conducts feature point matching with the original wall picture to conduct pose calculation. Using PicPose, even partially visible pictures can be used for localization, which is impossible for ArPico and ArUco. We present our implementation and experiment results in this paper.
Citations
More filters
Proceedings ArticleDOI
01 Dec 2020
TL;DR: DynaScale as discussed by the authors is a general framework that integrates existing image detection and matching algorithms with constructed image pyramids for extended matching, and selects the best matching result from the image pairs of different scales.
Abstract: Localization is one of the fundamental technologies to enable location-aware services in smart cities. Image feature matching plays an essential role in visual-based localization for moving IoT devices to navigate in various scenes. Conventional matching pipelines have issues with finding an accurate transformation model when a pair of to-be-matched images have a huge scale difference between the views of interested objects. The paper introduces DynaScale, a general framework that integrates existing image detection and matching algorithms with constructed image pyramids for extended matching, and selects the best matching result from the image pairs of different scales. We have designed an intelligent evaluation scheme of potential transformation models based on various tests including reasonable projections, the resembled size of region proposals, and similarities of bounded descriptors. The experimental result shows that, on selected datasets, DynaScale has 1.9 times better mean matching accuracy than existing methods from 24.32% to 45.91%, and produces about twice the number of useful, correctly matched frames in a moving robot’s video stream.
References
More filters
Book ChapterDOI
08 Oct 2016
TL;DR: The approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, which makes SSD easy to train and straightforward to integrate into systems that require a detection component.
Abstract: We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For \(300 \times 300\) input, SSD achieves 74.3 % mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for \(512 \times 512\) input, SSD achieves 76.9 % mAP, outperforming a comparable state of the art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at https://github.com/weiliu89/caffe/tree/ssd.

19,543 citations


"PicPose: Using Picture Posing for L..." refers background or methods in this paper

  • ...In PicPose, an object detection architecture based on the deep neural network Single Shot MultiBox Detector (SSD) [3] is employed for the detection purpose....

    [...]

  • ...Another issue of using SSD in the first step of realtime localization is the object detection speed....

    [...]

  • ...The total processing time is longer due to the picture object detection overhead on streaming videos using SSD [3], but still within an acceptable range ( 100ms) to support real-time robot localization....

    [...]

  • ...The deep learning framework we use for training is Caffe, in which the MobileNets-SSD is implemented with pre-trained weights model on VOC0712....

    [...]

  • ...At runtime when a camera sends in a picture, the captured object ID produced by the SSD detector is used to retrieve the feature points from the ground truth library....

    [...]

Posted Content
TL;DR: This work introduces two simple global hyper-parameters that efficiently trade off between latency and accuracy and demonstrates the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
Abstract: We present a class of efficient models called MobileNets for mobile and embedded vision applications. MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy. These hyper-parameters allow the model builder to choose the right sized model for their application based on the constraints of the problem. We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification. We then demonstrate the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.

14,406 citations


"PicPose: Using Picture Posing for L..." refers methods in this paper

  • ...To provide the localization service in real time, a lightweight neural network model called MobileNets [19], which is based on a streamlined architecture using separable convolutions to build lightweight deep neural networks, is used as the base detector network in SSD....

    [...]

  • ...The deep learning framework we use for training is Caffe, in which the MobileNets-SSD is implemented with pre-trained weights model on VOC0712....

    [...]

Book ChapterDOI
TL;DR: SSD as mentioned in this paper discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, and combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes.
Abstract: We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. Our SSD model is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stage and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, MS COCO, and ILSVRC datasets confirm that SSD has comparable accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. Compared to other single stage methods, SSD has much better accuracy, even with a smaller input image size. For $300\times 300$ input, SSD achieves 72.1% mAP on VOC2007 test at 58 FPS on a Nvidia Titan X and for $500\times 500$ input, SSD achieves 75.1% mAP, outperforming a comparable state of the art Faster R-CNN model. Code is available at this https URL .

12,678 citations

Proceedings ArticleDOI
06 Nov 2011
TL;DR: This paper proposes a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise, and demonstrates through experiments how ORB is at two orders of magnitude faster than SIFT, while performing as well in many situations.
Abstract: Feature matching is at the base of many computer vision problems, such as object recognition or structure from motion. Current methods rely on costly descriptors for detection and matching. In this paper, we propose a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise. We demonstrate through experiments how ORB is at two orders of magnitude faster than SIFT, while performing as well in many situations. The efficiency is tested on several real-world applications, including object detection and patch-tracking on a smart phone.

8,702 citations


"PicPose: Using Picture Posing for L..." refers methods in this paper

  • ...Meanwhile, the feature points of the captured picture are extracted by ORB using the same configuration and parameters as those used for building the ground truth library....

    [...]

  • ...Our internal feature point extraction model is based on the Oriented FAST and Rotated BRIEF (ORB) [20] feature detector....

    [...]

  • ...First, the features of original pictures can be extracted offline and stored in an ORB ground truth library....

    [...]

  • ...However, the position of a feature point is determined prior to its descriptor calculation in ORB; it is difficult to guarantee the uniqueness and particularity in a picture, especially within a small area containing similar feature points....

    [...]

  • ...Each feature point records its coordinate in the picture and an ORB feature descriptor....

    [...]

Proceedings ArticleDOI
27 Jun 2016
TL;DR: This work revisits the global average pooling layer proposed in [13], and sheds light on how it explicitly enables the convolutional neural network (CNN) to have remarkable localization ability despite being trained on imagelevel labels.
Abstract: In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network (CNN) to have remarkable localization ability despite being trained on imagelevel labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that exposes the implicit attention of CNNs on an image. Despite the apparent simplicity of global average pooling, we are able to achieve 37.1% top-5 error for object localization on ILSVRC 2014 without training on any bounding box annotation. We demonstrate in a variety of experiments that our network is able to localize the discriminative image regions despite just being trained for solving classification task1.

5,978 citations


"PicPose: Using Picture Posing for L..." refers methods in this paper

  • ...Another recent active area is DNN (Deep Neural Network), which has been used to detect objects [17], [18] to enhance the detection precision....

    [...]