scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Toward Knowledge as a Service Over Networks: A Deep Learning Model Communication Paradigm

TL;DR: This paper presents the deep learning model communication paradigm based on multiple model compression, which greatly exploits the redundancy among multiple deep learning models in different application scenarios and analyzes the potential and demonstrates the promise of the compression strategy for deepLearning model communication through a set of experiments.
Abstract: The advent of artificial intelligence and Internet of Things has led to the seamless transition turning the big data into the big knowledge. The deep learning models, which assimilate knowledge from large-scale data, can be regarded as an alternative but promising modality of knowledge for artificial intelligence services. Yet, the compression, storage, and communication of the deep learning models towards better knowledge services, especially over networks, pose a set of challenging problems on both industrial and academic realms. This paper presents the deep learning model communication paradigm based on multiple model compression, which greatly exploits the redundancy among multiple deep learning models in different application scenarios. We analyze the potential and demonstrate the promise of the compression strategy for deep learning model communication through a set of experiments. Moreover, the interoperability in deep learning model communication, which is enabled based on the standardization of compact deep learning model representation, is also discussed and envisioned.
Citations
More filters
Journal ArticleDOI
Ling-Yu Duan1, Jiaying Liu1, Wenhan Yang1, Tiejun Huang1, Wen Gao1 
TL;DR: In this paper, a new area, Video Coding for Machines (VCM), is proposed to bridge the gap between feature coding for machine vision and video coding for human vision, and the preliminary results have demonstrated the performance and efficiency gains.
Abstract: Video coding, which targets to compress and reconstruct the whole frame, and feature compression, which only preserves and transmits the most critical information, stand at two ends of the scale. That is, one is with compactness and efficiency to serve for machine vision, and the other is with full fidelity, bowing to human perception. The recent endeavors in imminent trends of video compression, e.g. deep learning based coding tools and end-to-end image/video coding, and MPEG-7 compact feature descriptor standards, i.e. Compact Descriptors for Visual Search and Compact Descriptors for Video Analysis, promote the sustainable and fast development in their own directions, respectively. In this article, thanks to booming AI technology, e.g. prediction and generation models, we carry out exploration in the new area, Video Coding for Machines (VCM), arising from the emerging MPEG standardization efforts. 1 Towards collaborative compression and intelligent analytics, VCM attempts to bridge the gap between feature coding for machine vision and video coding for human vision. Aligning with the rising Analyze then Compress instance Digital Retina, the definition, formulation, and paradigm of VCM are given first. Meanwhile, we systematically review state-of-the-art techniques in video compression and feature compression from the unique perspective of MPEG standardization, which provides the academic and industrial evidence to realize the collaborative compression of video and feature streams in a broad range of AI applications. Finally, we come up with potential VCM solutions, and the preliminary results have demonstrated the performance and efficiency gains. Further direction is discussed as well. 1 https://lists.aau.at/mailman/listinfo/mpeg-vcm

113 citations

Proceedings ArticleDOI
Yuanzhi Liang1, Yalong Bai, Wei Zhang, Xueming Qian1, Li Zhu1, Tao Mei 
01 Feb 2019
TL;DR: Zhang et al. as discussed by the authors proposed a new scene graph dataset named Visually-Relevant Relationships Dataset (VrR-VG) based on Visual Genome.
Abstract: Relationships encode the interactions among individual instances and play a critical role in deep visual scene understanding. Suffering from the high predictability with non-visual information, relationship models tend to fit the statistical bias rather than ``learning" to infer the relationships from images. To encourage further development in visual relationships, we propose a novel method to mine more valuable relationships by automatically pruning visually-irrelevant relationships. We construct a new scene graph dataset named Visually-Relevant Relationships Dataset (VrR-VG) based on Visual Genome. Compared with existing datasets, the performance gap between learnable and statistical method is more significant in VrR-VG, and frequency-based analysis does not work anymore. Moreover, we propose to learn a relationship-aware representation by jointly considering instances, attributes and relationships. By applying the representation-aware feature learned on VrR-VG, the performances of image captioning and visual question answering are systematically improved, which demonstrates the effectiveness of both our dataset and features embedding schema. Both our VrR-VG dataset and representation-aware features will be made publicly available soon.

42 citations

Posted Content
Yuanzhi Liang1, Yalong Bai, Wei Zhang, Xueming Qian1, Li Zhu1, Tao Mei 
TL;DR: By applying the representation-aware feature learned on VrR-VG, the performances of image captioning and visual question answering are systematically improved, which demonstrates the effectiveness of both the dataset and features embedding schema.
Abstract: Relationships encode the interactions among individual instances, and play a critical role in deep visual scene understanding. Suffering from the high predictability with non-visual information, existing methods tend to fit the statistical bias rather than ``learning'' to ``infer'' the relationships from images. To encourage further development in visual relationships, we propose a novel method to automatically mine more valuable relationships by pruning visually-irrelevant ones. We construct a new scene-graph dataset named Visually-Relevant Relationships Dataset (VrR-VG) based on Visual Genome. Compared with existing datasets, the performance gap between learnable and statistical method is more significant in VrR-VG, and frequency-based analysis does not work anymore. Moreover, we propose to learn a relationship-aware representation by jointly considering instances, attributes and relationships. By applying the representation-aware feature learned on VrR-VG, the performances of image captioning and visual question answering are systematically improved with a large margin, which demonstrates the gain of our dataset and the features embedding schema. VrR-VG is available via this http URL.

23 citations


Cites background from "Toward Knowledge as a Service Over ..."

  • ...Representation Learning: Numerous deep learning methods have been proposed for representation learning with various knowledge [31, 22, 5, 30]....

    [...]

Journal ArticleDOI
TL;DR: A vision towards a novel visual computing framework, termed as digital retina, which aligns high-efficiency sensing models with the emerging Visual Coding for Machine (VCM) paradigm is articulated, and extensive experiments have been conducted to validate that it is able to effectively support the video big data analysis and retrieval in the intelligent city system.
Abstract: The ubiquitous camera networks in the city brain system grow at a rapid pace, creating massive amounts of images and videos at a range of spatial-temporal scales and thereby forming the “biggest” big data. However, the sensing system often lags behind the construction of the fast-growing city brain system, in the sense that such exponentially growing data far exceed today’s sensing capabilities. Therefore, critical issues arise regarding how to better leverage the existing city brain system and significantly improve the city-scale performance in intelligent applications. To tackle the unprecedented challenges, we articulate a vision towards a novel visual computing framework, termed as digital retina , which aligns high-efficiency sensing models with the emerging Visual Coding for Machine (VCM) paradigm. In particular, digital retina may consist of video coding, feature coding, model coding, as well as their joint optimization. The digital retina is biologically-inspired, rooted on the widely accepted view that the retina encodes the visual information for human perception, and extracts features by the brain downstream areas to disentangle the visual objects. Within the digital retina framework, three streams, i.e., video stream, feature stream, and model stream, work collaboratively over the end-edge-cloud platform. In particular, the compressed video stream serves for human vision, the compact feature stream targets for machine vision, and the model stream incrementally updates deep learning models to improve the performance of human/machine vision tasks. We have developed a prototype to demonstrate the technical advantages of digital retina, and extensive experiments have been conducted to validate that it is able to effectively support the video big data analysis and retrieval in the intelligent city system. In particular, up to $7000\times $ compression ratio could be realized for visual data compression while maintaining competitive performance with pristine signal in a series of visual analysis tasks.

19 citations

Posted Content
Ling-Yu Duan1, Jiaying Liu1, Wenhan Yang1, Tiejun Huang1, Wen Gao1 
TL;DR: The definition, formulation, and paradigm of VCM are given, and the preliminary results have demonstrated the performance and efficiency gains, and state-of-the-art techniques in video compression and feature compression are systematically reviewed from the unique perspective of MPEG standardization.
Abstract: Video coding, which targets to compress and reconstruct the whole frame, and feature compression, which only preserves and transmits the most critical information, stand at two ends of the scale. That is, one is with compactness and efficiency to serve for machine vision, and the other is with full fidelity, bowing to human perception. The recent endeavors in imminent trends of video compression, e.g. deep learning based coding tools and end-to-end image/video coding, and MPEG-7 compact feature descriptor standards, i.e. Compact Descriptors for Visual Search and Compact Descriptors for Video Analysis, promote the sustainable and fast development in their own directions, respectively. In this paper, thanks to booming AI technology, e.g. prediction and generation models, we carry out exploration in the new area, Video Coding for Machines (VCM), arising from the emerging MPEG standardization efforts1. Towards collaborative compression and intelligent analytics, VCM attempts to bridge the gap between feature coding for machine vision and video coding for human vision. Aligning with the rising Analyze then Compress instance Digital Retina, the definition, formulation, and paradigm of VCM are given first. Meanwhile, we systematically review state-of-the-art techniques in video compression and feature compression from the unique perspective of MPEG standardization, which provides the academic and industrial evidence to realize the collaborative compression of video and feature streams in a broad range of AI applications. Finally, we come up with potential VCM solutions, and the preliminary results have demonstrated the performance and efficiency gains. Further direction is discussed as well.

11 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Proceedings ArticleDOI
Jia Deng1, Wei Dong1, Richard Socher1, Li-Jia Li1, Kai Li1, Li Fei-Fei1 
20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

49,639 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

40,257 citations