scispace - formally typeset
Search or ask a question

Showing papers by "Yongfeng Huang published in 2018"


Journal ArticleDOI
Chuhan Wu1, Fangzhao Wu2, Sixing Wu1, Zhigang Yuan1, Yongfeng Huang1 
TL;DR: This paper proposes a hybrid unsupervised method which can combine rules and machine learning methods to address ATE and OTE tasks, and uses chunk-level linguistic rules to extract nominal phrase chunks and regard them as candidate opinion targets and aspects.
Abstract: Aspect term extraction (ATE) and opinion target extraction (OTE) are two important tasks in fine-grained sentiment analysis field. Existing approaches to ATE and OTE are mainly based on rules or machine learning methods. Rule-based methods are usually unsupervised, but they can’t make use of high level features. Although supervised learning approaches usually outperform the rule-based ones, they need a large number of labeled samples to train their models, which are expensive and time-consuming to annotate. In this paper, we propose a hybrid unsupervised method which can combine rules and machine learning methods to address ATE and OTE tasks. First, we use chunk-level linguistic rules to extract nominal phrase chunks and regard them as candidate opinion targets and aspects. Then we propose to filter irrelevant candidates based on domain correlation. Finally, we use these texts with extracted chunks as pseudo labeled data to train a deep gated recurrent unit (GRU) network for aspect term extraction and opinion target extraction. The experiments on benchmark datasets validate the effectiveness of our approach in extracting opinion targets and aspects with minimal manual annotation.

78 citations


Journal ArticleDOI
TL;DR: Experimental results show that the proposed model can achieve 98.67% accuracy and 96.02% recall, which strongly supports that using convolutional neural network to automatically learn high-level semantic features of electronic medical records and then conduct assist diagnosis is feasible and effective.
Abstract: Automatically extracting useful information from electronic medical records along with conducting disease diagnoses is a promising task for both clinical decision support(CDS) and neural language processing(NLP). Most of the existing systems are based on artificially constructed knowledge bases, and then auxiliary diagnosis is done by rule matching. In this study, we present a clinical intelligent decision approach based on Convolutional Neural Networks(CNN), which can automatically extract high-level semantic information of electronic medical records and then perform automatic diagnosis without artificial construction of rules or knowledge bases. We use collected 18,590 copies of the real-world clinical electronic medical records to train and test the proposed model. Experimental results show that the proposed model can achieve 98.67% accuracy and 96.02% recall, which strongly supports that using convolutional neural network to automatically learn high-level semantic features of electronic medical records and then conduct assist diagnosis is feasible and effective.

74 citations


Journal ArticleDOI
Zhigang Yuan1, Sixing Wu1, Fangzhao Wu2, Junxin Liu1, Yongfeng Huang1 
TL;DR: A domain attention model for multi-domain sentiment analysis based on multi-task learning that can extract the most discriminative features from a shared hidden layer in a more compact way.
Abstract: Sentiment classification is widely known as a domain-dependent problem. In order to learn an accurate domain-specific sentiment classifier, a large number of labeled samples are needed, which are expensive and time-consuming to annotate. Multi-domain sentiment analysis based on multi-task learning can leverage labeled samples in each single domain, which can alleviate the need for large amount of labeled data in all domains. In this paper, we propose a domain attention model for multi-domain sentiment analysis. In our approach, the domain representation is used as attention to select the most domain-related features in each domain. The domain representation is obtained through an auxiliary domain classification task, which works as domain regularizer. In this way, both shared and domain-specific features for sentiment classification are extracted simultaneously. In contrast with existing multi-domain sentiment classification methods, our approach can extract the most discriminative features from a shared hidden layer in a more compact way. Experimental results on two multi-domain sentiment datasets validate the effectiveness of our approach.

70 citations


Proceedings ArticleDOI
01 Jun 2018
TL;DR: This model combines CNN and LSTM layers to utilize both local and long-range contextual information for identifying metaphorical information in plain texts to extract metaphors from plain texts at word level.
Abstract: Metaphors are figurative languages widely used in daily life and literatures. It’s an important task to detect the metaphors evoked by texts. Thus, the metaphor shared task is aimed to extract metaphors from plain texts at word level. We propose to use a CNN-LSTM model for this task. Our model combines CNN and LSTM layers to utilize both local and long-range contextual information for identifying metaphorical information. In addition, we compare the performance of the softmax classifier and conditional random field (CRF) for sequential labeling in this task. We also incorporated some additional features such as part of speech (POS) tags and word cluster to improve the performance of model. Our best model achieved 65.06% F-score in the all POS testing subtask and 67.15% in the verbs testing subtask.

67 citations


Proceedings ArticleDOI
01 Jun 2018
TL;DR: A system based on a densely connected LSTM network with multi-task learning strategy that detects the ironic tweets and their ironic types and includes several types of features to improve the model performance.
Abstract: Detecting irony is an important task to mine fine-grained information from social web messages. Therefore, the Semeval-2018 task 3 is aimed to detect the ironic tweets (subtask A) and their ironic types (subtask B). In order to address this task, we propose a system based on a densely connected LSTM network with multi-task learning strategy. In our dense LSTM model, each layer will take all outputs from previous layers as input. The last LSTM layer will output the hidden representations of texts, and they will be used in three classification task. In addition, we incorporate several types of features to improve the model performance. Our model achieved an F-score of 70.54 (ranked 2/43) in the subtask A and 49.47 (ranked 3/29) in the subtask B. The experimental results validate the effectiveness of our system.

63 citations


Journal ArticleDOI
TL;DR: Inspired by the characteristic of Twitter100k, a method to integrate optical character recognition into cross-media retrieval is proposed and the experiment results show that the proposed method improves the baseline performance.
Abstract: This paper contributes a new large-scale dataset for weakly supervised cross-media retrieval, named Twitter100k. Current datasets, such as Wikipedia, NUS Wide, and Flickr30k, have two major limitations. First, these datasets are lacking in content diversity, i.e., only some predefined classes are covered. Second, texts in these datasets are written in well-organized language, leading to inconsistency with realistic applications. To overcome these drawbacks, the proposed Twitter100k dataset is characterized by two aspects: it has 100 000 image–text pairs randomly crawled from Twitter, and thus, has no constraint in the image categories; and text in Twitter100k is written in informal language by the users. Since strongly supervised methods leverage the class labels that may be missing in practice, this paper focuses on weakly supervised learning for cross-media retrieval, in which only text-image pairs are exploited during training. We extensively benchmark the performance of four subspace learning methods and three variants of the correspondence AutoEncoder, along with various text features on Wikipedia, Flickr30k, and Twitter100k. As a minor contribution, we also design a deep neural network to learn cross-modal embeddings for Twitter100k. Inspired by the characteristic of Twitter100k, we propose a method to integrate optical character recognition into cross-media retrieval. The experiment results show that the proposed method improves the baseline performance.

55 citations


Proceedings ArticleDOI
Chuhan Wu1, Fangzhao Wu2, Junxin Liu1, Sixing Wu1, Yongfeng Huang1, Xing Xie2 
01 Oct 2018
TL;DR: This paper describes a neural approach with hierarchical tweet representation and multi-head self-attention (HTR-MSA) for both tasks of the third Social Media Mining for Health Applications (SMM4H) workshop, which aims to detect the tweets mentioning drug names and adverse drug reactions.
Abstract: This paper describes our system for the first and third shared tasks of the third Social Media Mining for Health Applications (SMM4H) workshop, which aims to detect the tweets mentioning drug names and adverse drug reactions. In our system we propose a neural approach with hierarchical tweet representation and multi-head self-attention (HTR-MSA) for both tasks. Our system achieved the first place in both the first and third shared tasks of SMM4H with an F-score of 91.83% and 52.20% respectively.

31 citations


Posted Content
TL;DR: This paper proposes a steganography method which can automatically generate steganographic text based on the Markov chain model and Huffman coding and shows that the performance of the proposed model is superior to all the previous related methods in terms of information imperceptibility and information hidden capacity.
Abstract: Steganography, as one of the three basic information security systems, has long played an important role in safeguarding the privacy and confidentiality of data in cyberspace. The text is the most widely used information carrier in people's daily life, using text as a carrier for information hiding has broad research prospects. However, due to the high coding degree and less information redundancy in the text, it has been an extremely challenging problem to hide information in it for a long time. In this paper, we propose a steganography method which can automatically generate steganographic text based on the Markov chain model and Huffman coding. It can automatically generate fluent text carrier in terms of secret information which need to be embedded. The proposed model can learn from a large number of samples written by people and obtain a good estimate of the statistical language model. We evaluated the proposed model from several perspectives. Experimental results show that the performance of the proposed model is superior to all the previous related methods in terms of information imperceptibility and information hidden capacity.

30 citations


Journal ArticleDOI
17 Apr 2018-Entropy
TL;DR: A modified resilient backpropagation (MRPROP) algorithm is implemented to improve the convergence and efficiency of CNN training and a tolerant band is introduced to avoid network overtraining.
Abstract: The convolution neural network (CNN) has achieved state-of-the-art performance in many computer vision applications e.g., classification, recognition, detection, etc. However, the global optimization of CNN training is still a problem. Fast classification and training play a key role in the development of the CNN. We hypothesize that the smoother and optimized the training of a CNN goes, the more efficient the end result becomes. Therefore, in this paper, we implement a modified resilient backpropagation (MRPROP) algorithm to improve the convergence and efficiency of CNN training. Particularly, a tolerant band is introduced to avoid network overtraining, which is incorporated with the global best concept for weight updating criteria to allow the training algorithm of the CNN to optimize its weights more swiftly and precisely. For comparison, we present and analyze four different training algorithms for CNN along with MRPROP, i.e., resilient backpropagation (RPROP), Levenberg-Marquardt (LM), conjugate gradient (CG), and gradient descent with momentum (GDM). Experimental results showcase the merit of the proposed approach on a public face and skin dataset.

29 citations


Book ChapterDOI
08 Jun 2018
TL;DR: Experimental results show that the proposed model can be very efficient to implement the embedding and extraction of information and generated dialogue texts are of high quality which shows high concealment.
Abstract: Steganography based on texts has always been a hot but extremely hard research topic Due to the high coding characteristics of the text compared to other information carriers, the redundancy of information is very low, which makes it really difficult to hide information inside In this paper, combined with the recurrent neural network (RNN) and reinforcement learning (RL), we designed and implemented a real-time interactive text steganography model (RITS) The proposed model can automatically generate semantically coherent and syntactically correct dialogues based on the input sentence, through the reasonable encoding of the text in the dialog generation process to realize secret information hiding and transmission We trained our model using publicly collected datasets which contains 5808 dialogues and evaluated the proposed model from several perspectives Experimental results show that the proposed model can be very efficient to implement the embedding and extraction of information The generated dialogue texts are of high quality which shows high concealment

23 citations


Journal ArticleDOI
TL;DR: A new similarity measurement in the embedded space is introduced, which significantly improved system performance compared with the conventional Euclidean distance and demonstrated the efficiency of the proposed retrieval method on three different datasets to simplify and improve general image retrieval.
Abstract: The selection of semantic concepts for modal construction and data collection remains an open research issue. It is highly demanding to choose good multimedia concepts with small semantic gaps to facilitate the work of cross-media system developers. However, very little work has been done in this area. This paper contributes a new, real-world web image dataset for cross-media retrieval called FB5K. The proposed FB5K dataset contains the following attributes: 1) 5130 images crawled from Facebook; 2) images that are categorized according to users’ feelings; 3) images independent of text and language rather than using feelings for search. Furthermore, we propose a novel approach through the use of Optical Character Recognition and explicit incorporation of high-level semantic information. We comprehensively compute the performance of four different subspace-learning methods and three modified versions of the Correspondence Auto Encoder, alongside numerous text features and similarity measurements comparing Wikipedia, Flickr30k, and FB5K. To check the characteristics of FB5K, we propose a semantic-based cross-media retrieval method. To accomplish cross-media retrieval, we introduced a new similarity measurement in the embedded space, which significantly improved system performance compared with the conventional Euclidean distance. Our experimental results demonstrated the efficiency of the proposed retrieval method on three different datasets to simplify and improve general image retrieval.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors presented a clinical intelligent decision approach based on Convolutional Neural Networks (CNN), which can automatically extract high-level semantic information of electronic medical records and then perform automatic diagnosis without artificial construction of rules or knowledge bases.
Abstract: Automatically extracting useful information from electronic medical records along with conducting disease diagnoses is a promising task for both clinical decision support(CDS) and neural language processing(NLP). Most of the existing systems are based on artificially constructed knowledge bases, and then auxiliary diagnosis is done by rule matching. In this study, we present a clinical intelligent decision approach based on Convolutional Neural Networks(CNN), which can automatically extract high-level semantic information of electronic medical records and then perform automatic diagnosis without artificial construction of rules or knowledge bases. We use collected 18,590 copies of the real-world clinical electronic medical records to train and test the proposed model. Experimental results show that the proposed model can achieve 98.67\% accuracy and 96.02\% recall, which strongly supports that using convolutional neural network to automatically learn high-level semantic features of electronic medical records and then conduct assist diagnosis is feasible and effective.

Proceedings ArticleDOI
Chuhan Wu1, Fangzhao Wu2, Sixing Wu1, Yongfeng Huang1, Xing Xie2 
08 Oct 2018
TL;DR: A neural approach to predict multiple emojis evoked by plain tweets using convolutional neural network and a multi-label classification module that outperforms several automatic baselines as well as humans in this task.
Abstract: With the development of social media, a huge number of users are attracted by social platforms such as Twitter. Emojis are widely used by social network users when posting messages. Therefore, it is important to mine the relationships between plain texts and emojis. In this paper, we present a neural approach to predict multiple emojis evoked by plain tweets. Our model contains three modules, i.e., a character encoder to learn representations of words from original characters using convolutional neural network (CNN), a sentence encoder to learn representations of sentences using a combination of long short-term memory (LSTM) network and CNN, a multi-label classification module to predict the emojis evoked by a tweet. Besides, attention mechanism is applied at word-level to select important contexts. Our approach is self-labeling and free from expensive and time-consuming manual annotation. Experiments on real-world datasets show that our model outperforms several automatic baselines as well as humans in this task.

Posted Content
TL;DR: Experimental results show that the proposed text steganalysis method (TS-CNN) can achieve nearly 100\% precision and recall, outperforms all the previous methods and can even estimate the capacity of the hidden information inside.
Abstract: Steganalysis has been an important research topic in cybersecurity that helps to identify covert attacks in public network. With the rapid development of natural language processing technology in the past two years, coverless steganography has been greatly developed. Previous text steganalysis methods have shown unsatisfactory results on this new steganography technique and remain an unsolved challenge. Different from all previous text steganalysis methods, in this paper, we propose a text steganalysis method(TS-CNN) based on semantic analysis, which uses convolutional neural network(CNN) to extract high-level semantic features of texts, and finds the subtle distribution differences in the semantic space before and after embedding the secret information. To train and test the proposed model, we collected and released a large text steganalysis(CT-Steg) dataset, which contains a total number of 216,000 texts with various lengths and various embedding rates. Experimental results show that the proposed model can achieve nearly 100\% precision and recall, outperforms all the previous methods. Furthermore, the proposed model can even estimate the capacity of the hidden information inside. These results strongly support that using the subtle changes in the semantic space before and after embedding the secret information to conduct text steganalysis is feasible and effective.

Posted Content
TL;DR: An automatic audio generation-based steganography (AAG-Stega), which can automatically generate high-quality audio covers on the basis of the secret bits stream that needs to be embedded, and can guarantee high hidden capacity and concealment at the same time.
Abstract: Steganography, as one of the three basic information security systems, has long played an important role in safeguarding the privacy and confidentiality of data in cyberspace Audio is one of the most common means of information transmission in our daily life Thus it's of great practical significance to using audio as a carrier of information hiding At present, almost all audio-based information hiding methods are based on carrier modification mode However, this mode is equivalent to adding noise to the original signal, resulting in a difference in the statistical feature distribution of the carrier before and after steganography, which impairs the concealment of the entire system In this paper, we propose an automatic audio generation-based steganography(AAG-Stega), which can automatically generate high-quality audio covers on the basis of the secret bits stream that needs to be embedded In the automatic audio generation process, we reasonably encode the conditional probability distribution space of each sampling point and select the corresponding signal output according to the bitstream to realize the secret information embedding We designed several experiments to test the proposed model from the perspectives of information imperceptibility and information hidden capacity The experimental results show that the proposed model can guarantee high hidden capacity and concealment at the same time

Proceedings ArticleDOI
01 Jun 2018
TL;DR: This work proposes a system based on an attention CNN-LSTM model that is used to extract the long-term contextual information from texts and applies attention techniques to selecting this information.
Abstract: Traditional sentiment analysis approaches mainly focus on classifying the sentiment polarities or emotion categories of texts. However, they can’t exploit the sentiment intensity information. Therefore, the SemEval-2018 Task 1 is aimed to automatically determine the intensity of emotions or sentiment of tweets to mine fine-grained sentiment information. In order to address this task, we propose a system based on an attention CNN-LSTM model. In our model, LSTM is used to extract the long-term contextual information from texts. We apply attention techniques to selecting this information. A CNN layer with different size of kernels is used to extract local features. The dense layers take the pooled CNN feature maps and predict the intensity scores. Our system reaches average Pearson correlation score of 0.722 (ranked 12/48) in emotion intensity regression task, and 0.810 in valence regression task (ranked 15/38). It indicates that our system can be further extended.

Proceedings ArticleDOI
01 Jun 2018
TL;DR: A residual CNN-LSTM with attention (RCLA) model is proposed that combines CNN and LSTM layers to capture both local and long-range contextual information for tweet representation and incorporated additional features such as POS tags and sentiment features extracted from lexicons.
Abstract: Emojis are widely used by social media and social network users when posting their messages. It is important to study the relationships between messages and emojis. Thus, in SemEval-2018 Task 2 an interesting and challenging task is proposed, i.e., predicting which emojis are evoked by text-based tweets. We propose a residual CNN-LSTM with attention (RCLA) model for this task. Our model combines CNN and LSTM layers to capture both local and long-range contextual information for tweet representation. In addition, attention mechanism is used to select important components. Besides, residual connection is applied to CNN layers to facilitate the training of neural networks. We also incorporated additional features such as POS tags and sentiment features extracted from lexicons. Our model achieved 30.25% macro-averaged F-score in the first subtask (i.e., emoji prediction in English), ranking 7th out of 48 participants.

Posted Content
Junxin Liu1, Fangzhao Wu2, Chuhan Wu1, Yongfeng Huang1, Xing Xie2 
TL;DR: The authors proposed two methods to exploit the dictionary information for Chinese word segmentation, one based on pseudo labeled data generation, and the second based on multi-task learning, which can effectively improve the performance of Chinese WSD.
Abstract: Chinese word segmentation (CWS) is an important task for Chinese NLP. Recently, many neural network based methods have been proposed for CWS. However, these methods require a large number of labeled sentences for model training, and usually cannot utilize the useful information in Chinese dictionary. In this paper, we propose two methods to exploit the dictionary information for CWS. The first one is based on pseudo labeled data generation, and the second one is based on multi-task learning. The experimental results on two benchmark datasets validate that our approach can effectively improve the performance of Chinese word segmentation, especially when training data is insufficient.

Book ChapterDOI
Junxin Liu1, Fangzhao Wu2, Chuhan Wu1, Yongfeng Huang1, Xing Xie2 
26 Aug 2018
TL;DR: The experimental results on two benchmark datasets validate that the proposed approach can effectively improve the performance of Chinese word segmentation, especially when training data is insufficient.
Abstract: Chinese word segmentation (CWS) is an important task for Chinese NLP. Recently, many neural network based methods have been proposed for CWS. However, these methods require a large number of labeled sentences for model training, and usually cannot utilize the useful information in Chinese dictionary. In this paper, we propose two methods to exploit the dictionary information for CWS. The first one is based on pseudo labeled data generation, and the second one is based on multi-task learning. The experimental results on two benchmark datasets validate that our approach can effectively improve the performance of Chinese word segmentation, especially when training data is insufficient.

Book ChapterDOI
08 Jun 2018
TL;DR: A new real-world web image dataset created by NGN Tsinghua Laboratory students for cross media search, which points out key features of social website images and identifies some research problems on image annotation and retrieval.
Abstract: Semantic concepts selection for model construction and data collection is an open research question. It is highly demanding to choose good multimedia concepts with small semantic gaps to facilitate the work of cross-media system developers. Since, this work is very scarce therefore; this paper contributes a new real-world web image dataset created by NGN Tsinghua Laboratory students for cross media search. Unlike previous datasets, such as Flicker30k, Wikipedia and NUS have high semantic gap, results in leading to inconsistency with real time applications. To overcome these drawbacks, the proposed Facebook5k dataset includes: (1) 5130 images crawled from Facebook through users feelings; (2) Images are categorized according to users feelings; (3) Facebook5k is independent of tags and language, rather than uses feelings for search. Based on the proposed dataset, we point out key features of social website images and identify some research problems on image annotation and retrieval. The benchmark results show the effectiveness of the proposed dataset to simplify and improve general image retrieval.

Journal Article
TL;DR: Wang et al. as mentioned in this paper proposed a new data structure called Rank-Based Merkle AVL Tree (RB-MAT) to improve the efficiency of data dynamics by improving the parts of query and rebalancing.
Abstract: Dynamic data possession verification is a common requirement in cloud storage systems. After the client outsources its data to the cloud, it needs to not only check the integrity of its data but also verify whether the update is executed correctly. Previous researches have proposed various schemes based on Merkle Hash Tree (MHT) and implemented some initial improvements to prevent the tree imbalance. This paper tries to take one step further: Is there still any problems remained for optimization? In this paper, we study how to raise the efficiency of data dynamics by improving the parts of query and rebalancing, using a new data structure called Rank-Based Merkle AVL Tree (RB-MAT). Furthermore, we fill the gap of verifying multiple update operations at the same time, which is the novel batch updating scheme. The experimental results show that our efficient scheme has better efficiency than those of existing methods.

Proceedings ArticleDOI
01 Nov 2018
TL;DR: A unified neural framework is proposed to fuse heterogeneous sentiment supervision to train sentence-level sentiment classification model and validate the effectiveness of the approach on benchmark datasets.
Abstract: Sentence-level sentiment classification aims to mine fine-grained sentiment information from texts. Existing methods for this task are usually based on supervised learning and rely on massive labeled sentences for model training. However, annotating sufficient sentences is expensive and time-consuming. In this paper, we propose a neural sentence-level sentiment classification approach which can exploit heterogeneous sentiment supervision and reduce the dependence on labeled sentences. Besides the sentence-level supervision from labeled sentences, our approach can also incorporate the word-level supervision extracted from sentiment lexicons, document-level supervision extracted from labeled documents and sentiment relations between sentences extracted from unlabeled documents. A unified neural framework is proposed to fuse heterogeneous sentiment supervision to train sentence-level sentiment classification model. Experiments on benchmark datasets validate the effectiveness of our approach.

Journal Article
TL;DR: Based on multiple vector quantization characteristics of the Line Spectrum Pair (LSP) of the speech codec, a steganography scheme using a 3D-magic matrix to enlarge capacity and improve quality of speech is proposed in this article.
Abstract: Redundant information of low-bit rate speech is extremely small, thus it’s very difficult to implement large capacity steganography on the low-bit rate speech Based on multiple vector quantization characteristics of the Line Spectrum Pair (LSP) of the speech codec, this paper proposes a steganography scheme using a 3D-Magic matrix to enlarge capacity and improve quality of speech A cyclically moving algorithm to construct a 3D-Magic matrix for steganography is proposed in this paper, as well as an embedding and an extracting algorithm of steganography based on the 3D-Magic matrix in low-bit rate speech codec Theoretical analysis is provided to demonstrate that the concealment and the hidden capacity are greatly improved with the proposed scheme Experimental results show the hidden capacity is raised to 200bps in ITU-T G7231 codec [1] Moreover, the quality of steganography speech in Perceptual Evaluation of Speech Quality (PESQ) reduces no more than 4%, indicating a little impact on the quality of speech In addition, the proposed hidden scheme could prevent being detected by some steganalysis tools effectively

Proceedings ArticleDOI
Chuhan Wu1, Fangzhao Wu2, Sixing Wu1, Zhigang Yuan1, Yongfeng Huang1 
01 Jun 2018
TL;DR: This work applies a multilayer perceptron (MLP)-convolutional neural network (CNN) model to identify whether an attribute is discriminative, which can help to improve the collective understanding of this novel task.
Abstract: Existing semantic models are capable of identifying the semantic similarity of words. However, it’s hard for these models to discriminate between a word and another similar word. Thus, the aim of SemEval-2018 Task 10 is to predict whether a word is a discriminative attribute between two concepts. In this task, we apply a multilayer perceptron (MLP)-convolutional neural network (CNN) model to identify whether an attribute is discriminative. The CNNs are used to extract low-level features from the inputs. The MLP takes both the flatten CNN maps and inputs to predict the labels. The evaluation F-score of our system on the test set is 0.629 (ranked 15th), which indicates that our system still needs to be improved. However, the behaviours of our system in our experiments provide useful information, which can help to improve the collective understanding of this novel task.

Book ChapterDOI
08 Jun 2018
TL;DR: The experimental results show that the decentralized web system based on Ethereum blockchain and IPFS network can provide faster access to web contents than the traditional HTTP web.
Abstract: The internet has a great development, but the cyber crimes are lack of effective supervision. Furthermore, the network congestion is still a common phenomenon in our daily life. This paper proposes a decentralized web system based on blockchain to solve the above problems. In this scheme, web content publisher uploads the web contents to the IPFS (InterPlanetary File System) net which is a peer to peer storage network, and gets the hashes of the web contents from IPFS system, then writes these hashes to the smart contract which has been deployed on the Ethereum blockchain; thousands of web users can read the hashes of these web contents on this smart contract on Ethereum blockchain, and browse the corresponding web contents by hashes on the decentralized IPFS network. The experimental results show that the decentralized web system based on Ethereum blockchain and IPFS network can provide faster access to web contents than the traditional HTTP web. Further more, this web system has strong ability to withstand large-scale concurrent based on IPFS network which is the decentralized web content storage system without centralized web server and can fight against the malicious tampering attack based on Ethereum blockchain which can resist 51% attack and record all transactions including tampering behaviors on the Ethereum blockchain with timestamp.

Posted Content
TL;DR: A cyclically moving algorithm to construct a 3D-Magic matrix for steganography that could prevent being detected by some steganalysis tools effectively and the quality of speech in Perceptual Evaluation of Speech Quality reduces no more than 4%, indicating a little impact on thequality of speech.
Abstract: Redundant information of low-bit-rate speech is extremely small, thus it's very difficult to implement large capacity steganography on the low-bit-rate speech. Based on multiple vector quantization characteristics of the Line Spectrum Pair (LSP) of the speech codec, this paper proposes a steganography scheme using a 3D-Magic matrix to enlarge capacity and improve quality of speech. A cyclically moving algorithm to construct a 3D-Magic matrix for steganography is proposed in this paper, as well as an embedding and an extracting algorithm of steganography based on the 3D-Magic matrix in low-bit-rate speech codec. Theoretical analysis is provided to demonstrate that the concealment and the hidden capacity are greatly improved with the proposed scheme. Experimental results show the hidden capacity is raised to 200bps in ITU-T G.723.1 codec. Moreover, the quality of steganography speech in Perceptual Evaluation of Speech Quality (PESQ) reduces no more than 4%, indicating a little impact on the quality of speech. In addition, the proposed hidden scheme could prevent being detected by some steganalysis tools effectively.

Book ChapterDOI
26 Aug 2018
TL;DR: A method called Neural Instance Selector (NIS) is proposed to solve the problems of distant supervision and noisy labels and can effectively filter noisy data and achieve better performance than several baseline methods.
Abstract: Distant supervised relation extraction is an efficient method to find novel relational facts from very large corpora without expensive manual annotation. However, distant supervision will inevitably lead to wrong label problem, and these noisy labels will substantially hurt the performance of relation extraction. Existing methods usually use multi-instance learning and selective attention to reduce the influence of noise. However, they usually cannot fully utilize the supervision information and eliminate the effect of noise. In this paper, we propose a method called Neural Instance Selector (NIS) to solve these problems. Our approach contains three modules, a sentence encoder to encode input texts into hidden vector representations, an NIS module to filter the less informative sentences via multilayer perceptrons and logistic classification, and a selective attention module to select the important sentences. Experimental results show that our method can effectively filter noisy data and achieve better performance than several baseline methods.

Book ChapterDOI
08 Jun 2018
TL;DR: The focused crawler based on open search engine proposed in this paper improves the recall rate and efficiency under the premise of ensuring the accuracy.
Abstract: When users need to analyze webpages related to some specific topics, generally they use crawlers to acquire webpages, and then analyze the results to extract those match the users’ interests. However, in data acquisition stage, users usually have customize demand on acquiring data. Ordinary crawler systems are very resource-constrained so they cannot traverse the entire internet. Meanwhile, search engines can satisfy these demand but it relies on many manual interactions. The traditional solution is to constrain the crawlers in some limited domain, but this will lead to the problem of low recall rate as well as inefficiency. In order to solve the problems above, this paper does some research on focused crawlers framework based on open search engine. It takes advantage of open search engine’s information gather and retrieval capabilities, and can automatically/semi-automatically generate the topic model to interpret and complete users search intents, with only a few seed keywords need to be provided initially. Then it uses open search engine interfaces to iteratively crawl topic-specific webpages. Compared with the traditional ways, the focused crawler based on open search engine proposed in this paper improves the recall rate and efficiency under the premise of ensuring the accuracy.