scispace - formally typeset
Search or ask a question
Author

Wenye Li

Bio: Wenye Li is an academic researcher. The author has contributed to research in topics: Probabilistic latent semantic analysis & Scalability. The author has an hindex of 1, co-authored 1 publications receiving 7 citations.

Papers
More filters
Proceedings ArticleDOI
01 Aug 2013
TL;DR: The empirical experiment results show that when the training dataset is large, learning the probability distributions of PLSA model in a parallel way can achieve almost linear speedups and thus provides a practical solution to large-scale data analysis applications.
Abstract: Probabilistic Latent Semantic Analysis (PLSA) is a powerful statistical technique to analyze relation between co-occurrence data, and has wide usages in automated information processing tasks. However it involves non-trivial computation and is often difficult and time-consuming to train when the dataset is big. MapReduce is a computing framework designed by Google which aims to provide a distributed solution to practically large-scale data analysis tasks using clusters of computers. In this work, we addressed the scalability problem of PLSA by proposing and implementing a parallel method to train PLSA under the MapReduce computing framework. The empirical experiment results show that when the training dataset is large, learning the probability distributions of PLSA model in a parallel way can achieve almost linear speedups and thus provides a practical solution to large-scale data analysis applications.

8 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: This work proposes a novel unsupervised shilling attack detection model based on an analysis of user rating behavior that measures the diversity and memory of users’ interest preferences by entropy and block entropy, respectively, and analyzes the memory ofuser rating preferences by a self-correlation analysis.
Abstract: The existing unsupervised methods for detecting shilling attacks are mostly based on the rating patterns of users, ignoring the rating behavior difference between genuine users and attack users, and these methods suffer from low accuracy in detecting various shilling attacks without a priori knowledge of the attacks. To address these limitations, we propose a novel unsupervised shilling attack detection model based on an analysis of user rating behavior. First, we identify the target item(s) and the corresponding intentions of the attack users by analyzing the deviation of rating tendencies on each item, and based on this analysis, a set of suspicious users is constructed. Second, we analyze the users’ rating behaviors from an interest preference and rating preference perspective. In particular, we measure the diversity and memory of users’ interest preferences by entropy and block entropy, respectively, and we analyze the memory of user rating preferences by a self-correlation analysis. Finally, we calculate the suspicious degree and spot attack users in the set of suspicious users based on measurements of user rating behavior. Experimental results on the Netflix dataset, the MovieLens 1M dataset and the sampled Amazon review dataset demonstrate the effectiveness of the proposed detection model.

33 citations

Journal ArticleDOI
TL;DR: An integrated system, such as ARIANA, could assist the human expert in exploratory literature search by bringing forward hidden associations, promoting data reuse and knowledge discovery as well as stimulating interdisciplinary projects by connecting information across the disciplines.
Abstract: The data overload has created a new set of challenges in finding meaningful and relevant information with minimal cognitive effort. However designing robust and scalable knowledge discovery systems remains a challenge. Recent innovations in the (biological) literature mining tools have opened new avenues to understand the confluence of various diseases, genes, risk factors as well as biological processes in bridging the gaps between the massive amounts of scientific data and harvesting useful knowledge. In this paper, we highlight some of the findings using a text analytics tool, called ARIANA - Adaptive Robust and Integrative Analysis for finding Novel Associations. Empirical study using ARIANA reveals knowledge discovery instances that illustrate the efficacy of such tool. For example, ARIANA can capture the connection between the drug hexamethonium and pulmonary inflammation and fibrosis that caused the tragic death of a healthy volunteer in a 2001 John Hopkins asthma study, even though the abstract of the study was not part of the semantic model. An integrated system, such as ARIANA, could assist the human expert in exploratory literature search by bringing forward hidden associations, promoting data reuse and knowledge discovery as well as stimulating interdisciplinary projects by connecting information across the disciplines.

12 citations

Journal ArticleDOI
TL;DR: The experimental results show that the parallel versions of the DEpLSA and the traditional pLSA approach can provide accurate HU results fast enough for practical use, accelerating the corresponding serial versions in at least 30x in the GTX 1080 and up to 147X in the Tesla P100 GPU, which are quite significant acceleration factors that increase with the image size, thus allowing for the possibility of the fast processing of massive HS data repositories.
Abstract: Hyperspectral unmixing (HU) is an important task for remotely sensed hyperspectral (HS) data exploitation. It comprises the identification of pure spectral signatures (endmembers) and their corresponding fractional abundances in each pixel of the HS data cube. Several methods have been developed for (semi-) supervised and automatic identification of endmembers and abundances. Recently, the statistical dual-depth sparse probabilistic latent semantic analysis (DEpLSA) method has been developed to tackle the HU problem as a latent topic-based approach in which both endmembers and abundances can be simultaneously estimated according to the semantics encapsulated by the latent topic space. However, statistical models usually lead to computationally demanding algorithms and the computational time of the DEpLSA is often too high for practical use, in particular, when the dimensionality of the HS data cube is large. In order to mitigate this limitation, this article resorts to graphical processing units (GPUs) to provide a new parallel version of the DEpLSA, developed using the NVidia compute device unified architecture. Our experimental results, conducted using four well-known HS datasets and two different GPU architectures (GTX 1080 and Tesla P100), show that our parallel versions of the DEpLSA and the traditional pLSA approach can provide accurate HU results fast enough for practical use, accelerating the corresponding serial versions in at least 30x in the GTX 1080 and up to 147x in the Tesla P100 GPU, which are quite significant acceleration factors that increase with the image size, thus allowing for the possibility of the fast processing of massive HS data repositories.

9 citations

Journal ArticleDOI
TL;DR: Simulation results demonstrate that the proposed distributed algorithm outperforms traditional distributed algorithms in terms of the size of data to be processed at the central server and data processing time.
Abstract: This paper introduces an efficient distributed data analysis framework for big data which comprises data processing at the data collecting nodes and the central server end as opposed to the existing framework that only comprises data processing at the central server end. As data are being processed at the data collecting end in the proposed framework, the amount of data is reduced to be processed at the server side by the commodity computers. The proposed distributed algorithm works both in low-powered nodes such as sensors and high-speed commodity computers and also performs sequential and parallel processing based on the amount of data received at the central server. Simulation results demonstrate that the proposed distributed algorithm outperforms traditional distributed algorithms in terms of the size of data to be processed at the central server and data processing time.

9 citations

Book ChapterDOI
01 Jan 2016
TL;DR: A novel approach for semantic analysis with web based big data using rule based ontology mapping using fuzzy rule based resource representation and a refined semantic relation reasoning mining is applied to obtain overall knowledge representation.
Abstract: A huge amount of data is being generated everyday through different transactions in industries, medicals, social networking, communication systems etc. This data is mainly of unstructured format in nature. Transformation of the large heterogeneous datasets into useful information is very much required for society. This huge unstructured information should be easily presented and made available in a significant and effective way to obtain semantic knowledge so that machine can interpret them. In this paper, we have introduced a novel approach for semantic analysis with web based big data using rule based ontology mapping. To handle social data with natural language terms, we have proposed the fuzzy rule based resource representation. After that, a refined semantic relation reasoning mining is applied to obtain overall knowledge representation. Finally semantic equivalent of these unstructured data is stored in structured database using Web Ontology Language (OWL) based ontology system.

4 citations