Showing papers by "Yongfeng Huang published in 2018"

PDF

Open Access

Journal Article•DOI•

A hybrid unsupervised method for aspect term and opinion target extraction

[...]

Chuhan Wu¹, Fangzhao Wu², Sixing Wu¹, Zhigang Yuan¹, Yongfeng Huang¹ - Show less +1 more•Institutions (2)

15 May 2018-Knowledge Based Systems

TL;DR: This paper proposes a hybrid unsupervised method which can combine rules and machine learning methods to address ATE and OTE tasks, and uses chunk-level linguistic rules to extract nominal phrase chunks and regard them as candidate opinion targets and aspects.

...read moreread less

Abstract: Aspect term extraction (ATE) and opinion target extraction (OTE) are two important tasks in fine-grained sentiment analysis field. Existing approaches to ATE and OTE are mainly based on rules or machine learning methods. Rule-based methods are usually unsupervised, but they can’t make use of high level features. Although supervised learning approaches usually outperform the rule-based ones, they need a large number of labeled samples to train their models, which are expensive and time-consuming to annotate. In this paper, we propose a hybrid unsupervised method which can combine rules and machine learning methods to address ATE and OTE tasks. First, we use chunk-level linguistic rules to extract nominal phrase chunks and regard them as candidate opinion targets and aspects. Then we propose to filter irrelevant candidates based on domain correlation. Finally, we use these texts with extracted chunks as pseudo labeled data to train a deep gated recurrent unit (GRU) network for aspect term extraction and opinion target extraction. The experiments on benchmark datasets validate the effectiveness of our approach in extracting opinion targets and aspects with minimal manual annotation.

...read moreread less

78 citations

Journal Article•DOI•

Clinical Assistant Diagnosis for Electronic Medical Record Based on Convolutional Neural Network

[...]

Zhongliang Yang¹, Yongfeng Huang¹, Yiran Jiang², Yuxi Sun², Yu-Jin Zhang¹, Pengcheng Luo - Show less +2 more•Institutions (2)

Tsinghua University¹, Beijing University of Posts and Telecommunications²

20 Apr 2018-Scientific Reports

TL;DR: Experimental results show that the proposed model can achieve 98.67% accuracy and 96.02% recall, which strongly supports that using convolutional neural network to automatically learn high-level semantic features of electronic medical records and then conduct assist diagnosis is feasible and effective.

...read moreread less

Abstract: Automatically extracting useful information from electronic medical records along with conducting disease diagnoses is a promising task for both clinical decision support(CDS) and neural language processing(NLP). Most of the existing systems are based on artificially constructed knowledge bases, and then auxiliary diagnosis is done by rule matching. In this study, we present a clinical intelligent decision approach based on Convolutional Neural Networks(CNN), which can automatically extract high-level semantic information of electronic medical records and then perform automatic diagnosis without artificial construction of rules or knowledge bases. We use collected 18,590 copies of the real-world clinical electronic medical records to train and test the proposed model. Experimental results show that the proposed model can achieve 98.67% accuracy and 96.02% recall, which strongly supports that using convolutional neural network to automatically learn high-level semantic features of electronic medical records and then conduct assist diagnosis is feasible and effective.

...read moreread less

74 citations

Journal Article•DOI•

Domain attention model for multi-domain sentiment classification

[...]

Zhigang Yuan¹, Sixing Wu¹, Fangzhao Wu², Junxin Liu¹, Yongfeng Huang¹ - Show less +1 more•Institutions (2)

Tsinghua University¹, Microsoft²

01 Sep 2018-Knowledge Based Systems

TL;DR: A domain attention model for multi-domain sentiment analysis based on multi-task learning that can extract the most discriminative features from a shared hidden layer in a more compact way.

...read moreread less

Abstract: Sentiment classification is widely known as a domain-dependent problem. In order to learn an accurate domain-specific sentiment classifier, a large number of labeled samples are needed, which are expensive and time-consuming to annotate. Multi-domain sentiment analysis based on multi-task learning can leverage labeled samples in each single domain, which can alleviate the need for large amount of labeled data in all domains. In this paper, we propose a domain attention model for multi-domain sentiment analysis. In our approach, the domain representation is used as attention to select the most domain-related features in each domain. The domain representation is obtained through an auxiliary domain classification task, which works as domain regularizer. In this way, both shared and domain-specific features for sentiment classification are extracted simultaneously. In contrast with existing multi-domain sentiment classification methods, our approach can extract the most discriminative features from a shared hidden layer in a more compact way. Experimental results on two multi-domain sentiment datasets validate the effectiveness of our approach.

...read moreread less

70 citations

Proceedings Article•DOI•

Neural Metaphor Detecting with CNN-LSTM Model

[...]

Chuhan Wu¹, Fangzhao Wu², Yubo Chen¹, Sixing Wu¹, Zhigang Yuan¹, Yongfeng Huang¹ - Show less +2 more•Institutions (2)

Tsinghua University¹, Microsoft²

01 Jun 2018

TL;DR: This model combines CNN and LSTM layers to utilize both local and long-range contextual information for identifying metaphorical information in plain texts to extract metaphors from plain texts at word level.

...read moreread less

Abstract: Metaphors are figurative languages widely used in daily life and literatures. It’s an important task to detect the metaphors evoked by texts. Thus, the metaphor shared task is aimed to extract metaphors from plain texts at word level. We propose to use a CNN-LSTM model for this task. Our model combines CNN and LSTM layers to utilize both local and long-range contextual information for identifying metaphorical information. In addition, we compare the performance of the softmax classifier and conditional random field (CRF) for sequential labeling in this task. We also incorporated some additional features such as part of speech (POS) tags and word cluster to improve the performance of model. Our best model achieved 65.06% F-score in the all POS testing subtask and 67.15% in the verbs testing subtask.

...read moreread less

67 citations

Proceedings Article•DOI•

THU_NGN at SemEval-2018 Task 3: Tweet Irony Detection with Densely connected LSTM and Multi-task Learning

[...]

Chuhan Wu¹, Fangzhao Wu², Sixing Wu¹, Junxin Liu¹, Zhigang Yuan¹, Yongfeng Huang¹ - Show less +2 more•Institutions (2)

Tsinghua University¹, Microsoft²

01 Jun 2018

TL;DR: A system based on a densely connected LSTM network with multi-task learning strategy that detects the ironic tweets and their ironic types and includes several types of features to improve the model performance.

...read moreread less

Abstract: Detecting irony is an important task to mine fine-grained information from social web messages. Therefore, the Semeval-2018 task 3 is aimed to detect the ironic tweets (subtask A) and their ironic types (subtask B). In order to address this task, we propose a system based on a densely connected LSTM network with multi-task learning strategy. In our dense LSTM model, each layer will take all outputs from previous layers as input. The last LSTM layer will output the hidden representations of texts, and they will be used in three classification task. In addition, we incorporate several types of features to improve the model performance. Our model achieved an F-score of 70.54 (ranked 2/43) in the subtask A and 49.47 (ranked 3/29) in the subtask B. The experimental results validate the effectiveness of our system.

...read moreread less

63 citations

Journal Article•DOI•

Twitter100k: A Real-World Dataset for Weakly Supervised Cross-Media Retrieval

[...]

Yuting Hu¹, Liang Zheng², Yi Yang², Yongfeng Huang¹•Institutions (2)

Tsinghua University¹, University of Technology, Sydney²

01 Apr 2018-IEEE Transactions on Multimedia

TL;DR: Inspired by the characteristic of Twitter100k, a method to integrate optical character recognition into cross-media retrieval is proposed and the experiment results show that the proposed method improves the baseline performance.

...read moreread less

Abstract: This paper contributes a new large-scale dataset for weakly supervised cross-media retrieval, named Twitter100k. Current datasets, such as Wikipedia, NUS Wide, and Flickr30k, have two major limitations. First, these datasets are lacking in content diversity, i.e., only some predefined classes are covered. Second, texts in these datasets are written in well-organized language, leading to inconsistency with realistic applications. To overcome these drawbacks, the proposed Twitter100k dataset is characterized by two aspects: it has 100 000 image–text pairs randomly crawled from Twitter, and thus, has no constraint in the image categories; and text in Twitter100k is written in informal language by the users. Since strongly supervised methods leverage the class labels that may be missing in practice, this paper focuses on weakly supervised learning for cross-media retrieval, in which only text-image pairs are exploited during training. We extensively benchmark the performance of four subspace learning methods and three variants of the correspondence AutoEncoder, along with various text features on Wikipedia, Flickr30k, and Twitter100k. As a minor contribution, we also design a deep neural network to learn cross-modal embeddings for Twitter100k. Inspired by the characteristic of Twitter100k, we propose a method to integrate optical character recognition into cross-media retrieval. The experiment results show that the proposed method improves the baseline performance.

...read moreread less

55 citations

Proceedings Article•DOI•

Detecting Tweets Mentioning Drug Name and Adverse Drug Reaction with Hierarchical Tweet Representation and Multi-Head Self-Attention

[...]

Chuhan Wu¹, Fangzhao Wu², Junxin Liu¹, Sixing Wu¹, Yongfeng Huang¹, Xing Xie² - Show less +2 more•Institutions (2)

Tsinghua University¹, Microsoft²

01 Oct 2018

TL;DR: This paper describes a neural approach with hierarchical tweet representation and multi-head self-attention (HTR-MSA) for both tasks of the third Social Media Mining for Health Applications (SMM4H) workshop, which aims to detect the tweets mentioning drug names and adverse drug reactions.

...read moreread less

Abstract: This paper describes our system for the first and third shared tasks of the third Social Media Mining for Health Applications (SMM4H) workshop, which aims to detect the tweets mentioning drug names and adverse drug reactions. In our system we propose a neural approach with hierarchical tweet representation and multi-head self-attention (HTR-MSA) for both tasks. Our system achieved the first place in both the first and third shared tasks of SMM4H with an F-score of 91.83% and 52.20% respectively.

...read moreread less

31 citations

Posted Content•

Automatically Generate Steganographic Text Based on Markov Model and Huffman Coding.

[...]

Zhongliang Yang, Shuyu Jin, Yongfeng Huang, Yu-Jin Zhang, Hui Li - Show less +1 more

12 Nov 2018-arXiv: Cryptography and Security

TL;DR: This paper proposes a steganography method which can automatically generate steganographic text based on the Markov chain model and Huffman coding and shows that the performance of the proposed model is superior to all the previous related methods in terms of information imperceptibility and information hidden capacity.

...read moreread less

Abstract: Steganography, as one of the three basic information security systems, has long played an important role in safeguarding the privacy and confidentiality of data in cyberspace. The text is the most widely used information carrier in people's daily life, using text as a carrier for information hiding has broad research prospects. However, due to the high coding degree and less information redundancy in the text, it has been an extremely challenging problem to hide information in it for a long time. In this paper, we propose a steganography method which can automatically generate steganographic text based on the Markov chain model and Huffman coding. It can automatically generate fluent text carrier in terms of secret information which need to be embedded. The proposed model can learn from a large number of samples written by people and obtain a good estimate of the statistical language model. We evaluated the proposed model from several perspectives. Experimental results show that the performance of the proposed model is superior to all the previous related methods in terms of information imperceptibility and information hidden capacity.

...read moreread less

30 citations

Journal Article•DOI•

Optimization of CNN through Novel Training Strategy for Visual Classification Problems.

[...]

Sadaqat Ur Rehman¹, Shanshan Tu², Obaid Ur Rehman, Yongfeng Huang¹, Chathura M. Sarathchandra Magurawalage³, Chin-Chen Chang⁴ - Show less +2 more•Institutions (4)

Tsinghua University¹, Beijing University of Technology², University of Essex³, Feng Chia University⁴

17 Apr 2018-Entropy

TL;DR: A modified resilient backpropagation (MRPROP) algorithm is implemented to improve the convergence and efficiency of CNN training and a tolerant band is introduced to avoid network overtraining.

...read moreread less

Abstract: The convolution neural network (CNN) has achieved state-of-the-art performance in many computer vision applications e.g., classification, recognition, detection, etc. However, the global optimization of CNN training is still a problem. Fast classification and training play a key role in the development of the CNN. We hypothesize that the smoother and optimized the training of a CNN goes, the more efficient the end result becomes. Therefore, in this paper, we implement a modified resilient backpropagation (MRPROP) algorithm to improve the convergence and efficiency of CNN training. Particularly, a tolerant band is introduced to avoid network overtraining, which is incorporated with the global best concept for weight updating criteria to allow the training algorithm of the CNN to optimize its weights more swiftly and precisely. For comparison, we present and analyze four different training algorithms for CNN along with MRPROP, i.e., resilient backpropagation (RPROP), Levenberg-Marquardt (LM), conjugate gradient (CG), and gradient descent with momentum (GDM). Experimental results showcase the merit of the proposed approach on a public face and skin dataset.

...read moreread less

29 citations

Book Chapter•DOI•

RITS: Real-Time Interactive Text Steganography Based on Automatic Dialogue Model

[...]

Zhongliang Yang¹, Pengyu Zhang², Minyu Jiang², Yongfeng Huang¹, Yu-Jin Zhang¹ - Show less +1 more•Institutions (2)

Tsinghua University¹, Beijing University of Technology²

08 Jun 2018

TL;DR: Experimental results show that the proposed model can be very efficient to implement the embedding and extraction of information and generated dialogue texts are of high quality which shows high concealment.

...read moreread less

Abstract: Steganography based on texts has always been a hot but extremely hard research topic Due to the high coding characteristics of the text compared to other information carriers, the redundancy of information is very low, which makes it really difficult to hide information inside In this paper, combined with the recurrent neural network (RNN) and reinforcement learning (RL), we designed and implemented a real-time interactive text steganography model (RITS) The proposed model can automatically generate semantically coherent and syntactically correct dialogues based on the input sentence, through the reasonable encoding of the text in the dialog generation process to realize secret information hiding and transmission We trained our model using publicly collected datasets which contains 5808 dialogues and evaluated the proposed model from several perspectives Experimental results show that the proposed model can be very efficient to implement the embedding and extraction of information The generated dialogue texts are of high quality which shows high concealment

...read moreread less

23 citations

Journal Article•DOI•

A Benchmark Dataset and Learning High-Level Semantic Embeddings of Multimedia for Cross-Media Retrieval

[...]

Sadaqat Ur Rehman¹, Shanshan Tu², Yongfeng Huang¹, Obaid Ur Rehman³•Institutions (3)

Tsinghua University¹, Beijing University of Technology², Sarhad University of Science & IT, Ring Road (Hayatabad Link) Peshawar³

01 Jan 2018-IEEE Access

TL;DR: A new similarity measurement in the embedded space is introduced, which significantly improved system performance compared with the conventional Euclidean distance and demonstrated the efficiency of the proposed retrieval method on three different datasets to simplify and improve general image retrieval.

...read moreread less

Abstract: The selection of semantic concepts for modal construction and data collection remains an open research issue. It is highly demanding to choose good multimedia concepts with small semantic gaps to facilitate the work of cross-media system developers. However, very little work has been done in this area. This paper contributes a new, real-world web image dataset for cross-media retrieval called FB5K. The proposed FB5K dataset contains the following attributes: 1) 5130 images crawled from Facebook; 2) images that are categorized according to users’ feelings; 3) images independent of text and language rather than using feelings for search. Furthermore, we propose a novel approach through the use of Optical Character Recognition and explicit incorporation of high-level semantic information. We comprehensively compute the performance of four different subspace-learning methods and three modified versions of the Correspondence Auto Encoder, alongside numerous text features and similarity measurements comparing Wikipedia, Flickr30k, and FB5K. To check the characteristics of FB5K, we propose a semantic-based cross-media retrieval method. To accomplish cross-media retrieval, we introduced a new similarity measurement in the embedded space, which significantly improved system performance compared with the conventional Euclidean distance. Our experimental results demonstrated the efficiency of the proposed retrieval method on three different datasets to simplify and improve general image retrieval.

...read moreread less

Journal Article•DOI•

Clinical Assistant Diagnosis for Electronic Medical Record Based on Convolutional Neural Network.

[...]

Zhongliang Yang, Yongfeng Huang, Yiran Jiang, Yuxi Sun, Yu-Jin Zhan, Pengcheng Luo - Show less +2 more

23 Apr 2018-arXiv: Computation and Language

TL;DR: Wang et al. as discussed by the authors presented a clinical intelligent decision approach based on Convolutional Neural Networks (CNN), which can automatically extract high-level semantic information of electronic medical records and then perform automatic diagnosis without artificial construction of rules or knowledge bases.

...read moreread less

Abstract: Automatically extracting useful information from electronic medical records along with conducting disease diagnoses is a promising task for both clinical decision support(CDS) and neural language processing(NLP). Most of the existing systems are based on artificially constructed knowledge bases, and then auxiliary diagnosis is done by rule matching. In this study, we present a clinical intelligent decision approach based on Convolutional Neural Networks(CNN), which can automatically extract high-level semantic information of electronic medical records and then perform automatic diagnosis without artificial construction of rules or knowledge bases. We use collected 18,590 copies of the real-world clinical electronic medical records to train and test the proposed model. Experimental results show that the proposed model can achieve 98.67\% accuracy and 96.02\% recall, which strongly supports that using convolutional neural network to automatically learn high-level semantic features of electronic medical records and then conduct assist diagnosis is feasible and effective.

...read moreread less

Proceedings Article•DOI•

Tweet Emoji Prediction Using Hierarchical Model with Attention

[...]

Chuhan Wu¹, Fangzhao Wu², Sixing Wu¹, Yongfeng Huang¹, Xing Xie² - Show less +1 more•Institutions (2)

Tsinghua University¹, Microsoft²

08 Oct 2018

TL;DR: A neural approach to predict multiple emojis evoked by plain tweets using convolutional neural network and a multi-label classification module that outperforms several automatic baselines as well as humans in this task.

...read moreread less

Abstract: With the development of social media, a huge number of users are attracted by social platforms such as Twitter. Emojis are widely used by social network users when posting messages. Therefore, it is important to mine the relationships between plain texts and emojis. In this paper, we present a neural approach to predict multiple emojis evoked by plain tweets. Our model contains three modules, i.e., a character encoder to learn representations of words from original characters using convolutional neural network (CNN), a sentence encoder to learn representations of sentences using a combination of long short-term memory (LSTM) network and CNN, a multi-label classification module to predict the emojis evoked by a tweet. Besides, attention mechanism is applied at word-level to select important contexts. Our approach is self-labeling and free from expensive and time-consuming manual annotation. Experiments on real-world datasets show that our model outperforms several automatic baselines as well as humans in this task.

...read moreread less

Posted Content•

TS-CNN: Text Steganalysis from Semantic Space Based on Convolutional Neural Network.

[...]

Zhongliang Yang, Nan Wei, Junyi Sheng, Yongfeng Huang, Yu-Jin Zhang - Show less +1 more

18 Oct 2018-arXiv: Cryptography and Security

TL;DR: Experimental results show that the proposed text steganalysis method (TS-CNN) can achieve nearly 100\% precision and recall, outperforms all the previous methods and can even estimate the capacity of the hidden information inside.

...read moreread less

Abstract: Steganalysis has been an important research topic in cybersecurity that helps to identify covert attacks in public network. With the rapid development of natural language processing technology in the past two years, coverless steganography has been greatly developed. Previous text steganalysis methods have shown unsatisfactory results on this new steganography technique and remain an unsolved challenge. Different from all previous text steganalysis methods, in this paper, we propose a text steganalysis method(TS-CNN) based on semantic analysis, which uses convolutional neural network(CNN) to extract high-level semantic features of texts, and finds the subtle distribution differences in the semantic space before and after embedding the secret information. To train and test the proposed model, we collected and released a large text steganalysis(CT-Steg) dataset, which contains a total number of 216,000 texts with various lengths and various embedding rates. Experimental results show that the proposed model can achieve nearly 100\% precision and recall, outperforms all the previous methods. Furthermore, the proposed model can even estimate the capacity of the hidden information inside. These results strongly support that using the subtle changes in the semantic space before and after embedding the secret information to conduct text steganalysis is feasible and effective.

...read moreread less

Posted Content•

AAG-Stega: Automatic Audio Generation-based Steganography

[...]

Zhongliang Yang, Xingjian Du, Yilin Tan, Yongfeng Huang, Yu-Jin Zhang - Show less +1 more

10 Sep 2018-arXiv: Cryptography and Security

TL;DR: An automatic audio generation-based steganography (AAG-Stega), which can automatically generate high-quality audio covers on the basis of the secret bits stream that needs to be embedded, and can guarantee high hidden capacity and concealment at the same time.

...read moreread less

Abstract: Steganography, as one of the three basic information security systems, has long played an important role in safeguarding the privacy and confidentiality of data in cyberspace Audio is one of the most common means of information transmission in our daily life Thus it's of great practical significance to using audio as a carrier of information hiding At present, almost all audio-based information hiding methods are based on carrier modification mode However, this mode is equivalent to adding noise to the original signal, resulting in a difference in the statistical feature distribution of the carrier before and after steganography, which impairs the concealment of the entire system In this paper, we propose an automatic audio generation-based steganography(AAG-Stega), which can automatically generate high-quality audio covers on the basis of the secret bits stream that needs to be embedded In the automatic audio generation process, we reasonably encode the conditional probability distribution space of each sampling point and select the corresponding signal output according to the bitstream to realize the secret information embedding We designed several experiments to test the proposed model from the perspectives of information imperceptibility and information hidden capacity The experimental results show that the proposed model can guarantee high hidden capacity and concealment at the same time

...read moreread less

Proceedings Article•DOI•

THU_NGN at SemEval-2018 Task 1: Fine-grained Tweet Sentiment Intensity Analysis with Attention CNN-LSTM

[...]

Chuhan Wu¹, Fangzhao Wu², Junxin Liu¹, Zhigang Yuan¹, Sixing Wu¹, Yongfeng Huang¹ - Show less +2 more•Institutions (2)

Tsinghua University¹, Microsoft²

01 Jun 2018

TL;DR: This work proposes a system based on an attention CNN-LSTM model that is used to extract the long-term contextual information from texts and applies attention techniques to selecting this information.

...read moreread less

Abstract: Traditional sentiment analysis approaches mainly focus on classifying the sentiment polarities or emotion categories of texts. However, they can’t exploit the sentiment intensity information. Therefore, the SemEval-2018 Task 1 is aimed to automatically determine the intensity of emotions or sentiment of tweets to mine fine-grained sentiment information. In order to address this task, we propose a system based on an attention CNN-LSTM model. In our model, LSTM is used to extract the long-term contextual information from texts. We apply attention techniques to selecting this information. A CNN layer with different size of kernels is used to extract local features. The dense layers take the pooled CNN feature maps and predict the intensity scores. Our system reaches average Pearson correlation score of 0.722 (ranked 12/48) in emotion intensity regression task, and 0.810 in valence regression task (ranked 15/38). It indicates that our system can be further extended.

...read moreread less

Proceedings Article•DOI•

THU_NGN at SemEval-2018 Task 2: Residual CNN-LSTM Network with Attention for English Emoji Prediction.

[...]

Chuhan Wu¹, Fangzhao Wu², Sixing Wu¹, Zhigang Yuan¹, Junxin Liu¹, Yongfeng Huang¹ - Show less +2 more•Institutions (2)

Tsinghua University¹, Microsoft²

01 Jun 2018

TL;DR: A residual CNN-LSTM with attention (RCLA) model is proposed that combines CNN and LSTM layers to capture both local and long-range contextual information for tweet representation and incorporated additional features such as POS tags and sentiment features extracted from lexicons.

...read moreread less

Abstract: Emojis are widely used by social media and social network users when posting their messages. It is important to study the relationships between messages and emojis. Thus, in SemEval-2018 Task 2 an interesting and challenging task is proposed, i.e., predicting which emojis are evoked by text-based tweets. We propose a residual CNN-LSTM with attention (RCLA) model for this task. Our model combines CNN and LSTM layers to capture both local and long-range contextual information for tweet representation. In addition, attention mechanism is used to select important components. Besides, residual connection is applied to CNN layers to facilitate the training of neural networks. We also incorporated additional features such as POS tags and sentiment features extracted from lexicons. Our model achieved 30.25% macro-averaged F-score in the first subtask (i.e., emoji prediction in English), ranking 7th out of 48 participants.

...read moreread less

Posted Content•

Neural Chinese Word Segmentation with Dictionary Knowledge

[...]

Junxin Liu¹, Fangzhao Wu², Chuhan Wu¹, Yongfeng Huang¹, Xing Xie² - Show less +1 more•Institutions (2)

Tsinghua University¹, Microsoft²

11 Jul 2018-arXiv: Computation and Language

TL;DR: The authors proposed two methods to exploit the dictionary information for Chinese word segmentation, one based on pseudo labeled data generation, and the second based on multi-task learning, which can effectively improve the performance of Chinese WSD.

...read moreread less

Abstract: Chinese word segmentation (CWS) is an important task for Chinese NLP. Recently, many neural network based methods have been proposed for CWS. However, these methods require a large number of labeled sentences for model training, and usually cannot utilize the useful information in Chinese dictionary. In this paper, we propose two methods to exploit the dictionary information for CWS. The first one is based on pseudo labeled data generation, and the second one is based on multi-task learning. The experimental results on two benchmark datasets validate that our approach can effectively improve the performance of Chinese word segmentation, especially when training data is insufficient.

...read moreread less

Book Chapter•DOI•

Neural Chinese Word Segmentation with Dictionary Knowledge

[...]

Junxin Liu¹, Fangzhao Wu², Chuhan Wu¹, Yongfeng Huang¹, Xing Xie² - Show less +1 more•Institutions (2)

Tsinghua University¹, Microsoft²

26 Aug 2018

TL;DR: The experimental results on two benchmark datasets validate that the proposed approach can effectively improve the performance of Chinese word segmentation, especially when training data is insufficient.

...read moreread less

Book Chapter•DOI•

Facebook5k: A Novel Evaluation Resource Dataset for Cross-Media Search

[...]

Sadaqat Ur Rehman¹, Yongfeng Huang¹, Shanshan Tu², Obaid Ur Rehman•Institutions (2)

Tsinghua University¹, Beijing University of Technology²

08 Jun 2018

TL;DR: A new real-world web image dataset created by NGN Tsinghua Laboratory students for cross media search, which points out key features of social website images and identifies some research problems on image annotation and retrieval.

...read moreread less

Abstract: Semantic concepts selection for model construction and data collection is an open research question. It is highly demanding to choose good multimedia concepts with small semantic gaps to facilitate the work of cross-media system developers. Since, this work is very scarce therefore; this paper contributes a new real-world web image dataset created by NGN Tsinghua Laboratory students for cross media search. Unlike previous datasets, such as Flicker30k, Wikipedia and NUS have high semantic gap, results in leading to inconsistency with real time applications. To overcome these drawbacks, the proposed Facebook5k dataset includes: (1) 5130 images crawled from Facebook through users feelings; (2) Images are categorized according to users feelings; (3) Facebook5k is independent of tags and language, rather than uses feelings for search. Based on the proposed dataset, we point out key features of social website images and identify some research problems on image annotation and retrieval. The benchmark results show the effectiveness of the proposed dataset to simplify and improve general image retrieval.

...read moreread less

Journal Article•

Enabling Efficient Verification of Dynamic Data Possession and Batch Updating in Cloud Storage

[...]

Yining Qi, Xin Tang, Yongfeng Huang

29 Jun 2018-Ksii Transactions on Internet and Information Systems

TL;DR: Wang et al. as mentioned in this paper proposed a new data structure called Rank-Based Merkle AVL Tree (RB-MAT) to improve the efficiency of data dynamics by improving the parts of query and rebalancing.

...read moreread less

Abstract: Dynamic data possession verification is a common requirement in cloud storage systems. After the client outsources its data to the cloud, it needs to not only check the integrity of its data but also verify whether the update is executed correctly. Previous researches have proposed various schemes based on Merkle Hash Tree (MHT) and implemented some initial improvements to prevent the tree imbalance. This paper tries to take one step further: Is there still any problems remained for optimization? In this paper, we study how to raise the efficiency of data dynamics by improving the parts of query and rebalancing, using a new data structure called Rank-Based Merkle AVL Tree (RB-MAT). Furthermore, we fill the gap of verifying multiple update operations at the same time, which is the novel batch updating scheme. The experimental results show that our efficient scheme has better efficiency than those of existing methods.

...read moreread less

Proceedings Article•DOI•

Neural Sentence-Level Sentiment Classification with Heterogeneous Supervision

[...]

Zhigang Yuan¹, Fangzhao Wu², Junxin Liu¹, Chuhan Wu¹, Yongfeng Huang¹, Xing Xie² - Show less +2 more•Institutions (2)

Tsinghua University¹, Microsoft²

01 Nov 2018

TL;DR: A unified neural framework is proposed to fuse heterogeneous sentiment supervision to train sentence-level sentiment classification model and validate the effectiveness of the approach on benchmark datasets.

...read moreread less

Abstract: Sentence-level sentiment classification aims to mine fine-grained sentiment information from texts. Existing methods for this task are usually based on supervised learning and rely on massive labeled sentences for model training. However, annotating sufficient sentences is expensive and time-consuming. In this paper, we propose a neural sentence-level sentiment classification approach which can exploit heterogeneous sentiment supervision and reduce the dependence on labeled sentences. Besides the sentence-level supervision from labeled sentences, our approach can also incorporate the word-level supervision extracted from sentiment lexicons, document-level supervision extracted from labeled documents and sentiment relations between sentences extracted from unlabeled documents. A unified neural framework is proposed to fuse heterogeneous sentiment supervision to train sentence-level sentiment classification model. Experiments on benchmark datasets validate the effectiveness of our approach.

...read moreread less

Journal Article•

A Novel Method of Speech Information Hiding Based on 3D-Magic Matrix

[...]

Zhong-Liang Yang, Xueshun Peng, Yongfeng Huang, Chin-Chen Chang

09 Sep 2018-Journal of Internet Technology

TL;DR: Based on multiple vector quantization characteristics of the Line Spectrum Pair (LSP) of the speech codec, a steganography scheme using a 3D-magic matrix to enlarge capacity and improve quality of speech is proposed in this article.

...read moreread less

Abstract: Redundant information of low-bit rate speech is extremely small, thus it’s very difficult to implement large capacity steganography on the low-bit rate speech Based on multiple vector quantization characteristics of the Line Spectrum Pair (LSP) of the speech codec, this paper proposes a steganography scheme using a 3D-Magic matrix to enlarge capacity and improve quality of speech A cyclically moving algorithm to construct a 3D-Magic matrix for steganography is proposed in this paper, as well as an embedding and an extracting algorithm of steganography based on the 3D-Magic matrix in low-bit rate speech codec Theoretical analysis is provided to demonstrate that the concealment and the hidden capacity are greatly improved with the proposed scheme Experimental results show the hidden capacity is raised to 200bps in ITU-T G7231 codec [1] Moreover, the quality of steganography speech in Perceptual Evaluation of Speech Quality (PESQ) reduces no more than 4%, indicating a little impact on the quality of speech In addition, the proposed hidden scheme could prevent being detected by some steganalysis tools effectively

...read moreread less

Proceedings Article•DOI•

THU_NGN at SemEval-2018 Task 10: Capturing Discriminative Attributes with MLP-CNN model

[...]

Chuhan Wu¹, Fangzhao Wu², Sixing Wu¹, Zhigang Yuan¹, Yongfeng Huang¹ - Show less +1 more•Institutions (2)

Tsinghua University¹, Microsoft²

01 Jun 2018

TL;DR: This work applies a multilayer perceptron (MLP)-convolutional neural network (CNN) model to identify whether an attribute is discriminative, which can help to improve the collective understanding of this novel task.

...read moreread less

Abstract: Existing semantic models are capable of identifying the semantic similarity of words. However, it’s hard for these models to discriminate between a word and another similar word. Thus, the aim of SemEval-2018 Task 10 is to predict whether a word is a discriminative attribute between two concepts. In this task, we apply a multilayer perceptron (MLP)-convolutional neural network (CNN) model to identify whether an attribute is discriminative. The CNNs are used to extract low-level features from the inputs. The MLP takes both the flatten CNN maps and inputs to predict the labels. The evaluation F-score of our system on the test set is 0.629 (ranked 15th), which indicates that our system still needs to be improved. However, the behaviours of our system in our experiments provide useful information, which can help to improve the collective understanding of this novel task.

...read moreread less

Book Chapter•DOI•

Design and Implementation of Web System Based on Blockchain

[...]

Ting Xiao¹, Yongfeng Huang¹•Institutions (1)

Tsinghua University¹

08 Jun 2018

TL;DR: The experimental results show that the decentralized web system based on Ethereum blockchain and IPFS network can provide faster access to web contents than the traditional HTTP web.

...read moreread less

Abstract: The internet has a great development, but the cyber crimes are lack of effective supervision. Furthermore, the network congestion is still a common phenomenon in our daily life. This paper proposes a decentralized web system based on blockchain to solve the above problems. In this scheme, web content publisher uploads the web contents to the IPFS (InterPlanetary File System) net which is a peer to peer storage network, and gets the hashes of the web contents from IPFS system, then writes these hashes to the smart contract which has been deployed on the Ethereum blockchain; thousands of web users can read the hashes of these web contents on this smart contract on Ethereum blockchain, and browse the corresponding web contents by hashes on the decentralized IPFS network. The experimental results show that the decentralized web system based on Ethereum blockchain and IPFS network can provide faster access to web contents than the traditional HTTP web. Further more, this web system has strong ability to withstand large-scale concurrent based on IPFS network which is the decentralized web content storage system without centralized web server and can fight against the malicious tampering attack based on Ethereum blockchain which can resist 51% attack and record all transactions including tampering behaviors on the Ethereum blockchain with timestamp.

...read moreread less

Posted Content•

A novel method of speech information hiding based on 3D-Magic Matrix

[...]

Zhong-Liang Yang, Xueshun Peng, Yongfeng Huang, Chin-Chen Chang

09 Sep 2018-arXiv: Cryptography and Security

TL;DR: A cyclically moving algorithm to construct a 3D-Magic matrix for steganography that could prevent being detected by some steganalysis tools effectively and the quality of speech in Perceptual Evaluation of Speech Quality reduces no more than 4%, indicating a little impact on thequality of speech.

...read moreread less

Abstract: Redundant information of low-bit-rate speech is extremely small, thus it's very difficult to implement large capacity steganography on the low-bit-rate speech. Based on multiple vector quantization characteristics of the Line Spectrum Pair (LSP) of the speech codec, this paper proposes a steganography scheme using a 3D-Magic matrix to enlarge capacity and improve quality of speech. A cyclically moving algorithm to construct a 3D-Magic matrix for steganography is proposed in this paper, as well as an embedding and an extracting algorithm of steganography based on the 3D-Magic matrix in low-bit-rate speech codec. Theoretical analysis is provided to demonstrate that the concealment and the hidden capacity are greatly improved with the proposed scheme. Experimental results show the hidden capacity is raised to 200bps in ITU-T G.723.1 codec. Moreover, the quality of steganography speech in Perceptual Evaluation of Speech Quality (PESQ) reduces no more than 4%, indicating a little impact on the quality of speech. In addition, the proposed hidden scheme could prevent being detected by some steganalysis tools effectively.

...read moreread less

Book Chapter•DOI•

Distant Supervision for Relation Extraction with Neural Instance Selector

[...]

Yubo Chen¹, Hongtao Liu², Chuhan Wu¹, Zhigang Yuan¹, Minyu Jiang³, Yongfeng Huang¹ - Show less +2 more•Institutions (3)

Tsinghua University¹, Tianjin University², Beijing University of Technology³

26 Aug 2018

TL;DR: A method called Neural Instance Selector (NIS) is proposed to solve the problems of distant supervision and noisy labels and can effectively filter noisy data and achieve better performance than several baseline methods.

...read moreread less

Abstract: Distant supervised relation extraction is an efficient method to find novel relational facts from very large corpora without expensive manual annotation. However, distant supervision will inevitably lead to wrong label problem, and these noisy labels will substantially hurt the performance of relation extraction. Existing methods usually use multi-instance learning and selective attention to reduce the influence of noise. However, they usually cannot fully utilize the supervision information and eliminate the effect of noise. In this paper, we propose a method called Neural Instance Selector (NIS) to solve these problems. Our approach contains three modules, a sentence encoder to encode input texts into hidden vector representations, an NIS module to filter the less informative sentences via multilayer perceptrons and logistic classification, and a selective attention module to select the important sentences. Experimental results show that our method can effectively filter noisy data and achieve better performance than several baseline methods.

...read moreread less

Book Chapter•DOI•

Focused Crawler Framework Based on Open Search Engine

[...]

Jiawei Liu¹, Yongfeng Huang¹•Institutions (1)

Tsinghua University¹

08 Jun 2018

TL;DR: The focused crawler based on open search engine proposed in this paper improves the recall rate and efficiency under the premise of ensuring the accuracy.

...read moreread less

Abstract: When users need to analyze webpages related to some specific topics, generally they use crawlers to acquire webpages, and then analyze the results to extract those match the users’ interests. However, in data acquisition stage, users usually have customize demand on acquiring data. Ordinary crawler systems are very resource-constrained so they cannot traverse the entire internet. Meanwhile, search engines can satisfy these demand but it relies on many manual interactions. The traditional solution is to constrain the crawlers in some limited domain, but this will lead to the problem of low recall rate as well as inefficiency. In order to solve the problems above, this paper does some research on focused crawlers framework based on open search engine. It takes advantage of open search engine’s information gather and retrieval capabilities, and can automatically/semi-automatically generate the topic model to interpret and complete users search intents, with only a few seed keywords need to be provided initially. Then it uses open search engine interfaces to iteratively crawl topic-specific webpages. Compared with the traditional ways, the focused crawler based on open search engine proposed in this paper improves the recall rate and efficiency under the premise of ensuring the accuracy.

...read moreread less