scispace - formally typeset
Search or ask a question

Showing papers by "Naver Corporation published in 2018"


Proceedings ArticleDOI
18 Jun 2018
TL;DR: StarGAN as discussed by the authors proposes a unified model architecture to perform image-to-image translation for multiple domains using only a single model, which leads to superior quality of translated images compared to existing models as well as the capability of flexibly translating an input image to any desired target domain.
Abstract: Recent studies have shown remarkable success in image-to-image translation for two domains. However, existing approaches have limited scalability and robustness in handling more than two domains, since different models should be built independently for every pair of image domains. To address this limitation, we propose StarGAN, a novel and scalable approach that can perform image-to-image translations for multiple domains using only a single model. Such a unified model architecture of StarGAN allows simultaneous training of multiple datasets with different domains within a single network. This leads to StarGAN's superior quality of translated images compared to existing models as well as the novel capability of flexibly translating an input image to any desired target domain. We empirically demonstrate the effectiveness of our approach on a facial attribute transfer and a facial expression synthesis tasks.

2,479 citations


Journal ArticleDOI
TL;DR: The core opportunities and risks of AI for society are introduced; a synthesis of five ethical principles that should undergird its development and adoption are presented; and 20 concrete recommendations are offered to serve as a firm foundation for the establishment of a Good AI Society.
Abstract: This article reports the findings of AI4People, an Atomium—EISMD initiative designed to lay the foundations for a “Good AI Society”. We introduce the core opportunities and risks of AI for society; present a synthesis of five ethical principles that should undergird its development and adoption; and offer 20 concrete recommendations—to assess, to develop, to incentivise, and to support good AI—which in some cases may be undertaken directly by national or supranational policy makers, while in others may be led by other stakeholders. If adopted, these recommendations would serve as a firm foundation for the establishment of a Good AI Society.

855 citations


Proceedings Article
21 May 2018
TL;DR: BAN is proposed that find bilinear attention distributions to utilize given vision-language information seamlessly and quantitatively and qualitatively evaluates the model on visual question answering and Flickr30k Entities datasets, showing that BAN significantly outperforms previous methods and achieves new state-of-the-arts on both datasets.
Abstract: Attention networks in multimodal learning provide an efficient way to utilize given visual information selectively. However, the computational cost to learn attention distributions for every pair of multimodal input channels is prohibitively expensive. To solve this problem, co-attention builds two separate attention distributions for each modality neglecting the interaction between multimodal inputs. In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly. BAN considers bilinear interactions among two groups of input channels, while low-rank bilinear pooling extracts the joint representations for each pair of channels. Furthermore, we propose a variant of multimodal residual networks to exploit eight-attention maps of the BAN efficiently. We quantitatively and qualitatively evaluate our model on visual question answering (VQA 2.0) and Flickr30k Entities datasets, showing that BAN significantly outperforms previous methods and achieves new state-of-the-arts on both datasets.

384 citations


Posted Content
TL;DR: This article proposed bilinear attention networks (BAN) that find bilinearly attention distributions to utilize given vision-language information seamlessly. But, the computational cost to learn attention distributions for every pair of multimodal input channels is prohibitively expensive.
Abstract: Attention networks in multimodal learning provide an efficient way to utilize given visual information selectively. However, the computational cost to learn attention distributions for every pair of multimodal input channels is prohibitively expensive. To solve this problem, co-attention builds two separate attention distributions for each modality neglecting the interaction between multimodal inputs. In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly. BAN considers bilinear interactions among two groups of input channels, while low-rank bilinear pooling extracts the joint representations for each pair of channels. Furthermore, we propose a variant of multimodal residual networks to exploit eight-attention maps of the BAN efficiently. We quantitatively and qualitatively evaluate our model on visual question answering (VQA 2.0) and Flickr30k Entities datasets, showing that BAN significantly outperforms previous methods and achieves new state-of-the-arts on both datasets.

281 citations


Posted Content
TL;DR: This paper presents a unified, general-purpose model of fuzzing together with a taxonomy of the current fuzzing literature, and methodically explores the design decisions at every stage of the model fuzzer by surveying the related literature and innovations in the art, science, and engineering that make modern-day fuzzers effective.
Abstract: Among the many software vulnerability discovery techniques available today, fuzzing has remained highly popular due to its conceptual simplicity, its low barrier to deployment, and its vast amount of empirical evidence in discovering real-world software vulnerabilities At a high level, fuzzing refers to a process of repeatedly running a program with generated inputs that may be syntactically or semantically malformed While researchers and practitioners alike have invested a large and diverse effort towards improving fuzzing in recent years, this surge of work has also made it difficult to gain a comprehensive and coherent view of fuzzing To help preserve and bring coherence to the vast literature of fuzzing, this paper presents a unified, general-purpose model of fuzzing together with a taxonomy of the current fuzzing literature We methodically explore the design decisions at every stage of our model fuzzer by surveying the related literature and innovations in the art, science, and engineering that make modern-day fuzzers effective

180 citations


Proceedings ArticleDOI
01 Jul 2018
TL;DR: CAPE, the first content-aware POI embedding model which utilizes text content that provides information about the characteristics of a POI, is proposed and constructed.
Abstract: Recommending a point-of-interest (POI) a user will visit next based on temporal and spatial context information is an important task in mobile-based applications. Recently, several POI recommendation models based on conventional sequential-data modeling approaches have been proposed. However, such models focus on only a user's check-in sequence information and the physical distance between POIs. Furthermore, they do not utilize the characteristics of POIs or the relationships between POIs. To address this problem, we propose CAPE, the first content-aware POI embedding model which utilizes text content that provides information about the characteristics of a POI. CAPE consists of a check-in context layer and a text content layer. The check-in context layer captures the geographical influence of POIs from the check-in sequence of a user, while the text content layer captures the characteristics of POIs from the text content. To validate the efficacy of CAPE, we constructed a large-scale POI dataset. In the experimental evaluation, we show that the performance of the existing POI recommendation models can be significantly improved by simply applying CAPE to the models.

171 citations


Posted Content
TL;DR: In this paper, a knowledge transfer method via distillation of activation boundaries formed by hidden neurons is proposed, where the student learns a separating boundary between activation region and deactivation region formed by each neuron in the teacher.
Abstract: An activation boundary for a neuron refers to a separating hyperplane that determines whether the neuron is activated or deactivated. It has been long considered in neural networks that the activations of neurons, rather than their exact output values, play the most important role in forming classification friendly partitions of the hidden feature space. However, as far as we know, this aspect of neural networks has not been considered in the literature of knowledge transfer. In this paper, we propose a knowledge transfer method via distillation of activation boundaries formed by hidden neurons. For the distillation, we propose an activation transfer loss that has the minimum value when the boundaries generated by the student coincide with those by the teacher. Since the activation transfer loss is not differentiable, we design a piecewise differentiable loss approximating the activation transfer loss. By the proposed method, the student learns a separating boundary between activation region and deactivation region formed by each neuron in the teacher. Through the experiments in various aspects of knowledge transfer, it is verified that the proposed method outperforms the current state-of-the-art.

155 citations


Proceedings ArticleDOI
17 Oct 2018
TL;DR: This paper proposes a novel GAN-based collaborative filtering (CF) framework to provide higher accuracy in recommendation and validate that vector-wise adversarial training employed in CFGAN is really effective to solve the problem of existing GAn-based CF methods.
Abstract: Generative Adversarial Networks (GAN) have achieved big success in various domains such as image generation, music generation, and natural language generation. In this paper, we propose a novel GAN-based collaborative filtering (CF) framework to provide higher accuracy in recommendation. We first identify a fundamental problem of existing GAN-based methods in CF and highlight it quantitatively via a series of experiments. Next, we suggest a new direction of vector-wise adversarial training to solve the problem and propose our GAN-based CF framework, called CFGAN, based on the direction. We identify a unique challenge that arises when vector-wise adversarial training is employed in CF. We then propose three CF methods realized on top of our CFGAN that are able to address the challenge. Finally, via extensive experiments on real-world datasets, we validate that vector-wise adversarial training employed in CFGAN is really effective to solve the problem of existing GAN-based CF methods. Furthermore, we demonstrate that our proposed CF methods on CFGAN provide recommendation accuracy consistently and universally higher than those of the state-of-the-art recommenders.

149 citations


Proceedings ArticleDOI
17 Dec 2018
TL;DR: A new context-aware correlation filter based tracking framework to achieve both high computational speed and state-of-the-art performance among real-time trackers and introduces extrinsic denoising processes and a new orthogonality loss term for pre-training and fine-tuning of the expert autoencoders.
Abstract: We propose a new context-aware correlation filter based tracking framework to achieve both high computational speed and state-of-the-art performance among real-time trackers. The major contribution to the high computational speed lies in the proposed deep feature compression that is achieved by a context-aware scheme utilizing multiple expert auto-encoders; a context in our framework refers to the coarse category of the tracking target according to appearance patterns. In the pre-training phase, one expert auto-encoder is trained per category. In the tracking phase, the best expert auto-encoder is selected for a given target, and only this auto-encoder is used. To achieve high tracking performance with the compressed feature map, we introduce extrinsic denoising processes and a new orthogonality loss term for pre-training and fine-tuning of the expert autoencoders. We validate the proposed context-aware framework through a number of experiments, where our method achieves a comparable performance to state-of-the-art trackers which cannot run in real-time, while running at a significantly fast speed of over 100 fps.

133 citations


Posted Content
TL;DR: The authors proposed a densely-connected co-attentive recurrent neural network (C-RNN), which uses concatenated information of attentive features as well as hidden features of all the preceding recurrent layers.
Abstract: Sentence matching is widely used in various natural language tasks such as natural language inference, paraphrase identification, and question answering. For these tasks, understanding logical and semantic relationship between two sentences is required but it is yet challenging. Although attention mechanism is useful to capture the semantic relationship and to properly align the elements of two sentences, previous methods of attention mechanism simply use a summation operation which does not retain original features enough. Inspired by DenseNet, a densely connected convolutional network, we propose a densely-connected co-attentive recurrent neural network, each layer of which uses concatenated information of attentive features as well as hidden features of all the preceding recurrent layers. It enables preserving the original and the co-attentive feature information from the bottommost word embedding layer to the uppermost recurrent layer. To alleviate the problem of an ever-increasing size of feature vectors due to dense concatenation operations, we also propose to use an autoencoder after dense concatenation. We evaluate our proposed architecture on highly competitive benchmark datasets related to sentence matching. Experimental results show that our architecture, which retains recurrent and attentive features, achieves state-of-the-art performances for most of the tasks.

107 citations


Book ChapterDOI
08 Sep 2018
TL;DR: A novel approach to generate multiple color palettes that reflect the semantics of input text and then colorize a given grayscale image according to the generated color palette, using a manually curated dataset called Palette-and-Text (PAT).
Abstract: This paper proposes a novel approach to generate multiple color palettes that reflect the semantics of input text and then colorize a given grayscale image according to the generated color palette. In contrast to existing approaches, our model can understand rich text, whether it is a single word, a phrase, or a sentence, and generate multiple possible palettes from it. For this task, we introduce our manually curated dataset called Palette-and-Text (PAT). Our proposed model called Text2Colors consists of two conditional generative adversarial networks: the text-to-palette generation networks and the palette-based colorization networks. The former captures the semantics of the text input and produce relevant color palettes. The latter colorizes a grayscale image using the generated color palette. Our evaluation results show that people preferred our generated palettes over ground truth palettes and that our model can effectively reflect the given palette when colorizing an image.

Proceedings ArticleDOI
TL;DR: In this paper, a context-aware correlation filter based tracking framework is proposed to achieve both high computational speed and state-of-the-art performance among real-time trackers.
Abstract: We propose a new context-aware correlation filter based tracking framework to achieve both high computational speed and state-of-the-art performance among real-time trackers. The major contribution to the high computational speed lies in the proposed deep feature compression that is achieved by a context-aware scheme utilizing multiple expert auto-encoders; a context in our framework refers to the coarse category of the tracking target according to appearance patterns. In the pre-training phase, one expert auto-encoder is trained per category. In the tracking phase, the best expert auto-encoder is selected for a given target, and only this auto-encoder is used. To achieve high tracking performance with the compressed feature map, we introduce extrinsic denoising processes and a new orthogonality loss term for pre-training and fine-tuning of the expert auto-encoders. We validate the proposed context-aware framework through a number of experiments, where our method achieves a comparable performance to state-of-the-art trackers which cannot run in real-time, while running at a significantly fast speed of over 100 fps.

Book ChapterDOI
08 Sep 2018
TL;DR: The best performance of the dual attention mechanism combined with late fusion by ablation studies are confirmed and MDAM achieves new state-of-the-art results with significant margins compared to the runner-up models.
Abstract: We propose a video story question-answering (QA) architecture, Multimodal Dual Attention Memory (MDAM). The key idea is to use a dual attention mechanism with late fusion. MDAM uses self-attention to learn the latent concepts in scene frames and captions. Given a question, MDAM uses the second attention over these latent concepts. Multimodal fusion is performed after the dual attention processes (late fusion). Using this processing pipeline, MDAM learns to infer a high-level vision-language joint representation from an abstraction of the full video content. We evaluate MDAM on PororoQA and MovieQA datasets which have large-scale QA annotations on cartoon videos and movies, respectively. For both datasets, MDAM achieves new state-of-the-art results with significant margins compared to the runner-up models. We confirm the best performance of the dual attention mechanism combined with late fusion by ablation studies. We also perform qualitative analysis by visualizing the inference mechanisms of MDAM.

Proceedings ArticleDOI
02 Sep 2018
TL;DR: This paper investigated regularization techniques, a multistep training scheme, and a residual connection with pooling layers in the perspective of mitigating speaker overfitting which lead to considerable performance improvements.
Abstract: In this research, we propose a novel raw waveform endto-end DNNs for text-independent speaker verification. For speaker verification, many studies utilize the speaker embedding scheme, which trains deep neural networks as speaker identifiers to extract speaker features. However, this scheme has an intrinsic limitation in which the speaker feature, trained to classify only known speakers, is required to represent the identity of unknown speakers. Owing to this mismatch, speaker embedding systems tend to well generalize towards unseen utterances from known speakers, but are overfitted to known speakers. This phenomenon is referred to as speaker overfitting. In this paper, we investigated regularization techniques, a multistep training scheme, and a residual connection with pooling layers in the perspective of mitigating speaker overfitting which lead to considerable performance improvements. Technique effectiveness is evaluated using the VoxCeleb dataset, which comprises over 1,200 speakers from various uncontrolled environments. To the best of our knowledge, we are the first to verify the success of end-to-end DNNs directly using raw waveforms in text-independent scenario. It shows an equal error rate of 7.4%, which is lower than i-vector/probabilistic linear discriminant analysis and end-to-end DNNs that use spectrograms.

Book ChapterDOI
08 Sep 2018
TL;DR: The evaluation protocol of the VisDrone-SOT2018 challenge is presented and the results of a comparison of 22 trackers on the benchmark dataset are presented, which are publicly available on the challenge website.
Abstract: Single-object tracking, also known as visual tracking, on the drone platform attracts much attention recently with various applications in computer vision, such as filming and surveillance. However, the lack of commonly accepted annotated datasets and standard evaluation platform prevent the developments of algorithms. To address this issue, the Vision Meets Drone Single-Object Tracking (VisDrone-SOT2018) Challenge workshop was organized in conjunction with the 15th European Conference on Computer Vision (ECCV 2018) to track and advance the technologies in such field. Specifically, we collect a dataset, including 132 video sequences divided into three non-overlapping sets, i.e., training (86 sequences with 69, 941 frames), validation (11 sequences with 7, 046 frames), and testing (35 sequences with 29, 367 frames) sets. We provide fully annotated bounding boxes of the targets as well as several useful attributes, e.g., occlusion, background clutter, and camera motion. The tracking targets in these sequences include pedestrians, cars, buses, and animals. The dataset is extremely challenging due to various factors, such as occlusion, large scale, pose variation, and fast motion. We present the evaluation protocol of the VisDrone-SOT2018 challenge and the results of a comparison of 22 trackers on the benchmark dataset, which are publicly available on the challenge website: http://www.aiskyeye.com/. We hope this challenge largely boosts the research and development in single object tracking on drone platforms.

Posted Content
TL;DR: DialogWAE is proposed, a conditional Wasserstein autoencoder specially designed for dialogue modeling that models the distribution of data by training a GAN within the latent variable space and develops a Gaussian mixture prior network to enrich the latent space.
Abstract: Variational autoencoders~(VAEs) have shown a promise in data-driven conversation modeling. However, most VAE conversation models match the approximate posterior distribution over the latent variables to a simple prior such as standard normal distribution, thereby restricting the generated responses to a relatively simple (e.g., unimodal) scope. In this paper, we propose DialogWAE, a conditional Wasserstein autoencoder~(WAE) specially designed for dialogue modeling. Unlike VAEs that impose a simple distribution over the latent variables, DialogWAE models the distribution of data by training a GAN within the latent variable space. Specifically, our model samples from the prior and posterior distributions over the latent variables by transforming context-dependent random noise using neural networks and minimizes the Wasserstein distance between the two distributions. We further develop a Gaussian mixture prior network to enrich the latent space. Experiments on two popular datasets show that DialogWAE outperforms the state-of-the-art approaches in generating more coherent, informative and diverse responses.

Proceedings ArticleDOI
27 Jun 2018
TL;DR: This work presents a scalable review-aware recommendation method, called SentiRec, that is guided to incorporate the sentiments of reviews when modeling the users and the items and drastically reduces the training time and the memory usage.
Abstract: Existing review-aware recommendation methods represent users (or items) through the concatenation of the reviews written by (or for) them, and depend entirely on convolutional neural networks (CNNs) to extract meaningful features for modeling users (or items). However, understanding reviews based only on the raw words of reviews is challenging because of the inherent ambiguity contained in them originated from the users' different tendency in writing. Moreover, it is inefficient in time and memory to model users/items by the concatenation of their associated reviews owing to considerably large inputs to CNNs. In this work, we present a scalable review-aware recommendation method, called SentiRec, that is guided to incorporate the sentiments of reviews when modeling the users and the items. SentiRec is a two-step approach composed of the first step that includes the encoding of each review into a fixed-size review vector that is trained to embody the sentiment of the review, followed by the second step that generates recommendations based on the vector-encoded reviews. Through our experiments, we show that SentiRec not only outperforms the existing review-aware methods, but also drastically reduces the training time and the memory usage. We also conduct a qualitative evaluation on the vector-encoded reviews trained by SentiRec to demonstrate that the overall sentiments are indeed encoded therein.

Journal ArticleDOI
TL;DR: The use of single-molecule FRET is used to study λ-exonuclease to find that metal-ion-coordination is correlated with enzymatic reaction-steps and offers insights into the origin of dynamic heterogeneity inenzymatic catalysis.
Abstract: Metal ions at the active site of an enzyme act as cofactors, and their dynamic fluctuations can potentially influence enzyme activity. Here, we use λ-exonuclease as a model enzyme with two Mg2+ binding sites and probe activity at various concentrations of magnesium by single-molecule-FRET. We find that while MgA2+ and MgB2+ have similar binding constants, the dissociation rate of MgA2+ is two order of magnitude lower than that of MgB2+ due to a kinetic-barrier-difference. At physiological Mg2+ concentration, the MgB2+ ion near the 5'-terminal side of the scissile phosphate dissociates each-round of degradation, facilitating a series of DNA cleavages via fast product-release concomitant with enzyme-translocation. At a low magnesium concentration, occasional dissociation and slow re-coordination of MgA2+ result in pauses during processive degradation. Our study highlights the importance of metal-ion-coordination dynamics in correlation with the enzymatic reaction-steps, and offers insights into the origin of dynamic heterogeneity in enzymatic catalysis.

Posted Content
TL;DR: In this paper, the authors proposed a video story question-answering (QA) architecture, Multimodal Dual Attention Memory (MDAM), which uses self-attention to learn the latent concepts in scene frames and captions.
Abstract: We propose a video story question-answering (QA) architecture, Multimodal Dual Attention Memory (MDAM). The key idea is to use a dual attention mechanism with late fusion. MDAM uses self-attention to learn the latent concepts in scene frames and captions. Given a question, MDAM uses the second attention over these latent concepts. Multimodal fusion is performed after the dual attention processes (late fusion). Using this processing pipeline, MDAM learns to infer a high-level vision-language joint representation from an abstraction of the full video content. We evaluate MDAM on PororoQA and MovieQA datasets which have large-scale QA annotations on cartoon videos and movies, respectively. For both datasets, MDAM achieves new state-of-the-art results with significant margins compared to the runner-up models. We confirm the best performance of the dual attention mechanism combined with late fusion by ablation studies. We also perform qualitative analysis by visualizing the inference mechanisms of MDAM.

Proceedings ArticleDOI
01 Nov 2018
TL;DR: In this paper, a replay attack spoofing detection system for automatic speaker verification using multi-task learning of noise classes is proposed, which includes classifying the noise of playback devices, recording environments, and recording devices as well as the spoofing detecting.
Abstract: In this paper, we propose a replay attack spoofing detection system for automatic speaker verification using multi-task learning of noise classes. We define the noise that is caused by the replay attack as replay noise. We explore the effectiveness of training a deep neural network simultaneously for replay attack spoofing detection and replay noise classification. The multi-task learning includes classifying the noise of playback devices, recording environments, and recording devices as well as the spoofing detection. Each of the three types of the noise classes also includes a genuine class. The experiment results on the version 1.0 of ASVspoof2017 datasets demonstrate that the performance of our proposed system is improved by 30% relatively on the evaluation set.

Patent
21 Sep 2018
TL;DR: In this paper, a method for processing personal data based on a block chain and a system thereof is presented, where a personal identification key is used to track and utilize the personal data for the corresponding user in the different services through the Personal Identification Key.
Abstract: The present invention provides a method for processing personal data based on a block chain and a system thereof. According to embodiments of the present invention, the method for processing personal data utilizes a personal identification key in a block chain network registered for a user in different services identifying the same user by different identifiers and provides personal data for the corresponding user on the block chain network, thereby tracking and utilizing the personal data for the corresponding user in the different services through the personal identification key. The method for processing personal data of a medium comprises the following steps of: managing the identifier of a member registered in the medium; interlocking the personal identification key which is issued by the block chain network by the member and identifies the user corresponding to the member with the identifier; and transmitting a block to participants of the block chain network by using the personal identification key so that the block including data related to an activity of the member is connected to the block chain.

Proceedings ArticleDOI
01 Oct 2018
TL;DR: This paper created a new manually annotated dataset of user generated data from the same domain as the training dataset, but from other sources and analyse the differences between the new and the standard ABSA dataset.
Abstract: In this paper, we test state-of-the-art Aspect Based Sentiment Analysis (ABSA) systems trained on a widely used dataset on actual data. We created a new manually annotated dataset of user generated data from the same domain as the training dataset, but from other sources and analyse the differences between the new and the standard ABSA dataset. We then analyse the results in performance of different versions of the same system on both datasets. We also propose light adaptation methods to increase system robustness.

Posted Content
TL;DR: An overview of the evolution of local features from handcrafted to deep-learning-based methods, followed by a discussion of several benchmarks and papers evaluating such local features, will help to fully understand the topic of image and region description in order to make best use of it in modern computer vision applications.
Abstract: This paper presents an overview of the evolution of local features from handcrafted to deep-learning-based methods, followed by a discussion of several benchmarks and papers evaluating such local features. Our investigations are motivated by 3D reconstruction problems, where the precise location of the features is important. As we describe these methods, we highlight and explain the challenges of feature extraction and potential ways to overcome them. We first present handcrafted methods, followed by methods based on classical machine learning and finally we discuss methods based on deep-learning. This largely chronologically-ordered presentation will help the reader to fully understand the topic of image and region description in order to make best use of it in modern computer vision applications. In particular, understanding handcrafted methods and their motivation can help to understand modern approaches and how machine learning is used to improve the results. We also provide references to most of the relevant literature and code.

Posted Content
TL;DR: In this paper, a teacher-student learning framework was applied to short utterance compensation for the first time in their knowledge. And they proposed an integrated text-independent speaker verification system that inputs utterances with short duration of 2 seconds or less.
Abstract: The short duration of an input utterance is one of the most critical threats that degrade the performance of speaker verification systems. This study aimed to develop an integrated text-independent speaker verification system that inputs utterances with short duration of 2 seconds or less. We propose an approach using a teacher-student learning framework for this goal, applied to short utterance compensation for the first time in our knowledge. The core concept of the proposed system is to conduct the compensation throughout the network that extracts the speaker embedding, mainly in phonetic-level, rather than compensating via a separate system after extracting the speaker embedding. In the proposed architecture, phonetic-level features where each feature represents a segment of 130 ms are extracted using convolutional layers. A layer of gated recurrent units extracts an utterance-level feature using phonetic-level features. The proposed approach also adopts a new objective function for teacher-student learning that considers both Kullback-Leibler divergence of output layers and cosine distance of speaker embeddings layers. Experiments were conducted using deep neural networks that take raw waveforms as input, and output speaker embeddings on VoxCeleb1 dataset. The proposed model could compensate approximately 65 \% of the performance degradation due to the shortened duration.

Posted Content
TL;DR: An LP-WaveNet vocoder, where the complicated interactions between vocal source and vocal tract components are jointly trained within a mixture density networkbased WaveNet model, which outperforms the conventional WaveNet vocoders both objectively and subjectively.
Abstract: We propose a linear prediction (LP)-based waveform generation method via WaveNet vocoding framework. A WaveNet-based neural vocoder has significantly improved the quality of parametric text-to-speech (TTS) systems. However, it is challenging to effectively train the neural vocoder when the target database contains massive amount of acoustical information such as prosody, style or expressiveness. As a solution, the approaches that only generate the vocal source component by a neural vocoder have been proposed. However, they tend to generate synthetic noise because the vocal source component is independently handled without considering the entire speech production process; where it is inevitable to come up with a mismatch between vocal source and vocal tract filter. To address this problem, we propose an LP-WaveNet vocoder, where the complicated interactions between vocal source and vocal tract components are jointly trained within a mixture density network-based WaveNet model. The experimental results verify that the proposed system outperforms the conventional WaveNet vocoders both objectively and subjectively. In particular, the proposed method achieves 4.47 MOS within the TTS framework.

Posted Content
TL;DR: In this article, a WaveNet-based neural excitation model (ExcitNet) is proposed for statistical parametric speech synthesis systems, which employs an adaptive inverse filter to decouple spectral components from the speech signal.
Abstract: This paper proposes a WaveNet-based neural excitation model (ExcitNet) for statistical parametric speech synthesis systems. Conventional WaveNet-based neural vocoding systems significantly improve the perceptual quality of synthesized speech by statistically generating a time sequence of speech waveforms through an auto-regressive framework. However, they often suffer from noisy outputs because of the difficulties in capturing the complicated time-varying nature of speech signals. To improve modeling efficiency, the proposed ExcitNet vocoder employs an adaptive inverse filter to decouple spectral components from the speech signal. The residual component, i.e. excitation signal, is then trained and generated within the WaveNet framework. In this way, the quality of the synthesized speech signal can be further improved since the spectral component is well represented by a deep learning framework and, moreover, the residual component is efficiently generated by the WaveNet framework. Experimental results show that the proposed ExcitNet vocoder, trained both speaker-dependently and speaker-independently, outperforms traditional linear prediction vocoders and similarly configured conventional WaveNet vocoders.

Posted Content
TL;DR: A new block called Concentrated-Comprehensive Convolution (C3) is proposed which applies the asymmetric convolutions before the depth-wise separable dilated Convolution to compensate for the information loss due to dilated convolution.
Abstract: One of the practical choices for making a lightweight semantic segmentation model is to combine a depth-wise separable convolution with a dilated convolution. However, the simple combination of these two methods results in an over-simplified operation which causes severe performance degradation due to loss of information contained in the feature map. To resolve this problem, we propose a new block called Concentrated-Comprehensive Convolution (C3) which applies the asymmetric convolutions before the depth-wise separable dilated convolution to compensate for the information loss due to dilated convolution. The C3 block consists of a concentration stage and a comprehensive convolution stage. The first stage uses two depth-wise asymmetric convolutions for compressed information from the neighboring pixels to alleviate the information loss. The second stage increases the receptive field by using a depth-wise separable dilated convolution from the feature map of the first stage. We applied the C3 block to various segmentation frameworks (ESPNet, DRN, ERFNet, ENet) for proving the beneficial properties of our proposed method. Experimental results show that the proposed method preserves the original accuracies on Cityscapes dataset while reducing the complexity. Furthermore, we modified ESPNet to achieve about 2% better performance while reducing the number of parameters by half and the number of FLOPs by 35% compared with the original ESPNet. Finally, experiments on ImageNet classification task show that C3 block can successfully replace dilated convolutions.

Journal ArticleDOI
TL;DR: The experimental results from two tasks of knowledge graph embedding prove that the proposed method not only incorporates new knowledge of new triples into the existing embedding successfully but also preserves the knowledge of the current embedding.
Abstract: This paper addresses an enrichment of translation-based knowledge graph embeddings. When new knowledge triples become available after a knowledge graph is embedded onto a vector space, the embedding should be enriched with the new triples, but without the triples used in training the embedding. The main challenge is that the enrichment of new triples should be accomplished without forgetting the knowledge of current embedding. This paper achieves the goal by minimizing a risk over the new triples penalized by rapid parameter change between old and new embedding models. The effectiveness of the proposed method is shown by learning a translation-based knowledge graph embedding trained incrementally using a series of knowledge triples. The experimental results from two tasks of knowledge graph embedding prove that the proposed method not only incorporates new knowledge of new triples into the existing embedding successfully but also preserves the knowledge of the current embedding.

Journal ArticleDOI
TL;DR: In this article, a generic mathematical framework is proposed which extends a low-dimensional manifold regularization in the conventional sigmoid space to an inverse elastic source problem with sparse measurements.
Abstract: An inverse elastic source problem with sparse measurements is our concern. A generic mathematical framework is proposed which extends a low-dimensional manifold regularization in the conventional s...

Proceedings ArticleDOI
06 Dec 2018
TL;DR: A novel disease prediction method, EHAN (EHR History-based prediction using Attention Network), based on the recurrent neural network (RNN) and attention mechanism is proposed, which outperformed the state-of-the-art model with respect to the various performance metrics.
Abstract: Precise prediction of severe diseases resulting in mortality is one of the main issues in medical fields. Even if pathological and radiological measurements provide competitive precision, they usually require large costs of time and expense to obtain and analyze the data for prediction. Recently, end-to-end approaches based on deep neural networks have been proposed, however, they still suffer from the low classification performance and difficulties of interpretation. In this study, we propose a novel disease prediction method, EHAN (EHR History-based prediction using Attention Network), based on the recurrent neural network (RNN) and attention mechanism. The proposed method incorporates (1) a bidirectional gated recurrent units (GRU) for automated sequential modeling, (2) attention mechanism for improving long-term dependence modeling, (3) RNN-based gradient-weighted class activation mapping (Grad-CAM) to visualize the class specific attention-weights. We conducted the experiments to predict the occurrence of risky disease containing cardiovascular and cerebrovascular diseases from more than 40,000 hypertension patients' electronic health records (EHR). The results showed that the proposed method outperformed the state-of-the-art model with respect to the various performance metrics. Furthermore, we confirmed that the proposed visualizing methods can be used to assist data-driven discovery.