scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Computer Science and Technology in 2021"


Journal ArticleDOI
TL;DR: In this paper, the authors conducted a systematic literature review on previous studies of serendipity-oriented recommender systems, focusing on the contextual convergence of Severnity definitions, datasets, and their evaluation techniques.
Abstract: A recommender system is employed to accurately recommend items, which are expected to attract the user’s attention. The over-emphasis on the accuracy of the recommendations can cause information over-specialization and make recommendations boring and even predictable. Novelty and diversity are two partly useful solutions to these problems. However, novel and diverse recommendations cannot merely ensure that users are attracted since such recommendations may not be relevant to the user’s interests. Hence, it is necessary to consider other criteria, such as unexpectedness and relevance. Serendipity is a criterion for making appealing and useful recommendations. The usefulness of serendipitous recommendations is the main superiority of this criterion over novelty and diversity. The bulk of studies of recommender systems have focused on serendipity in recent years. Thus, a systematic literature review is conducted in this paper on previous studies of serendipity-oriented recommender systems. Accordingly, this paper focuses on the contextual convergence of serendipity definitions, datasets, serendipitous recommendation methods, and their evaluation techniques. Finally, the trends and existing potentials of the serendipity-oriented recommender systems are discussed for future studies. The results of the systematic literature review present that the quality and the quantity of articles in the serendipity-oriented recommender systems are progressing.

28 citations


Journal ArticleDOI
TL;DR: The landscape of emerging NVMM technologies is revisited, and the state-of-the-art studies ofNVMM technologies are surveyed, as well as the recent work of hybrid memory system designs from the dimensions of architectures, systems, and applications.
Abstract: Non-Volatile Main Memories (NVMMs) have recently emerged as a promising technology for future memory systems Generally, NVMMs have many desirable properties such as high density, byte-addressability, non-volatility, low cost, and energy efficiency, at the expense of high write latency, high write power consumption, and limited write endurance NVMMs have become a competitive alternative of Dynamic Random Access Memory (DRAM), and will fundamentally change the landscape of memory systems They bring many research opportunities as well as challenges on system architectural designs, memory management in operating systems (OSes), and programming models for hybrid memory systems In this article, we first revisit the landscape of emerging NVMM technologies, and then survey the state-of-the-art studies of NVMM technologies We classify those studies with a taxonomy according to different dimensions such as memory architectures, data persistence, performance improvement, energy saving, and wear leveling Second, to demonstrate the best practices in building NVMM systems, we introduce our recent work of hybrid memory system designs from the dimensions of architectures, systems, and applications At last, we present our vision of future research directions of NVMMs and shed some light on design challenges and opportunities

21 citations


Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors developed an artificial intelligence (AI) system, named CytoBrain, to automatically screen abnormal cervical cells to help facilitate the subsequent clinical diagnosis of the subjects.
Abstract: Identification of abnormal cervical cells is a significant problem in computer-aided diagnosis of cervical cancer. In this study, we develop an artificial intelligence (AI) system, named CytoBrain, to automatically screen abnormal cervical cells to help facilitate the subsequent clinical diagnosis of the subjects. The system consists of three main modules: 1) the cervical cell segmentation module which is responsible for efficiently extracting cell images in a whole slide image (WSI); 2) the cell classification module based on a compact visual geometry group (VGG) network called CompactVGG which is the key part of the system and is used for building the cell classiffier; 3) the visualized human-aided diagnosis module which can automatically diagnose a WSI based on the classification results of cells in it, and provide two visual display modes for users to review and modify. For model construction and validation, we have developed a dataset containing 198 952 cervical cell images (60 238 positive, 25 001 negative, and 113 713 junk) from samples of 2 312 adult women. Since CompactVGG is the key part of CytoBrain, we conduct comparison experiments to evaluate its time and classification performance on our developed dataset and two public datasets separately. The comparison results with VGG11, the most efficient one in the family of VGG networks, show that CompactVGG takes less time for either model training or sample testing. Compared with three sophisticated deep learning models, CompactVGG consistently achieves the best classification performance. The results illustrate that the system based on CompactVGG is efficient and effective and can support for large-scale cervical cancer screening.

21 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a robust needle localization and enhancement algorithm based on deep learning and beam steering methods with three key innovations: beam steering to maximize the reflection intensity of the needle, which can help us to detect and locate the needle precisely.
Abstract: Ultrasound (US) imaging is clinically used to guide needle insertions because it is safe, real-time, and low cost. The localization of the needle in the ultrasound image, however, remains a challenging problem due to specular reflection off the smooth surface of the needle, speckle noise, and similar line-like anatomical features. This study presents a novel robust needle localization and enhancement algorithm based on deep learning and beam steering methods with three key innovations. First, we employ beam steering to maximize the reflection intensity of the needle, which can help us to detect and locate the needle precisely. Second, we modify the U-Net which is an end-to-end network commonly used in biomedical segmentation by using two branches instead of one in the last up-sampling layer and adding three layers after the last down-sample layer. Thus, the modified U-Net can real-time segment the needle shaft region, detect the needle tip landmark location and determine whether an image frame contains the needle by one shot. Third, we develop a needle fusion framework that employs the outputs of the multi-task deep learning (MTL) framework to precisely locate the needle tip and enhance needle shaft visualization. Thus, the proposed algorithm can not only greatly reduce the processing time, but also significantly increase the needle localization accuracy and enhance the needle visualization for real-time clinical intervention applications.

10 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a method named CDWBMS, which integrates a small number of verified circRNA-disease associations with a plenty of circRNA information to discover the novel circRNAdiseases associations, and adopts an improved weighted biased meta-structure search algorithm on a heterogeneous network to predict associations between circRNAs and diseases.
Abstract: Circular RNAs (circRNAs) are RNAs with a special closed loop structure, which play important roles in tumors and other diseases. Due to the time consumption of biological experiments, computational methods for predicting associations between circRNAs and diseases become a better choice. Taking the limited number of verified circRNA-disease associations into account, we propose a method named CDWBMS, which integrates a small number of verified circRNA-disease associations with a plenty of circRNA information to discover the novel circRNA-disease associations. CDWBMS adopts an improved weighted biased meta-structure search algorithm on a heterogeneous network to predict associations between circRNAs and diseases. In terms of leave-one-out-cross-validation (LOOCV), 10-fold cross-validation and 5-fold cross-validation, CDWBMS yields the area under the receiver operating characteristic curve (AUC) values of 0.921 6, 0.917 2 and 0.900 5, respectively. Furthermore, case studies show that CDWBMS can predict unknow circRNA-disease associations. In conclusion, CDWBMS is an effective method for exploring disease-related circRNAs.

10 citations


Journal ArticleDOI
TL;DR: In this paper, a unified formulation that employs proper label constraints for training models while simultaneously performing pseudo-labeling is proposed, which leverages similarities and differences in the feature space using the same candidate label constraints and disambiguates noise labels.
Abstract: Partial label learning is a weakly supervised learning framework in which each instance is associated with multiple candidate labels, among which only one is the ground-truth label. This paper proposes a unified formulation that employs proper label constraints for training models while simultaneously performing pseudo-labeling. Unlike existing partial label learning approaches that only leverage similarities in the feature space without utilizing label constraints, our pseudo-labeling process leverages similarities and differences in the feature space using the same candidate label constraints and then disambiguates noise labels. Extensive experiments on artificial and real-world partial label datasets show that our approach significantly outperforms state-of-the-art counterparts on classification prediction.

8 citations


Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a tuning system based on attention-based deep reinforcement learning named WATuning, which can adapt to the changes of workload characteristics and optimize the system performance efficiently and effectively.
Abstract: Configuration tuning is essential to optimize the performance of systems (e.g., databases, key-value stores). High performance usually indicates high throughput and low latency. At present, most of the tuning tasks of systems are performed artificially (e.g., by database administrators), but it is hard for them to achieve high performance through tuning in various types of systems and in various environments. In recent years, there have been some studies on tuning traditional database systems, but all these methods have some limitations. In this article, we put forward a tuning system based on attention-based deep reinforcement learning named WATuning, which can adapt to the changes of workload characteristics and optimize the system performance efficiently and effectively. Firstly, we design the core algorithm named ATT-Tune for WATuning to achieve the tuning task of systems. The algorithm uses workload characteristics to generate a weight matrix and acts on the internal metrics of systems, and then ATT-Tune uses the internal metrics with weight values assigned to select the appropriate configuration. Secondly, WATuning can generate multiple instance models according to the change of the workload so that it can complete targeted recommendation services for different types of workloads. Finally, WATuning can also dynamically fine-tune itself according to the constantly changing workload in practical applications so that it can better fit to the actual environment to make recommendations. The experimental results show that the throughput and the latency of WATuning are improved by 52.6% and decreased by 31%, respectively, compared with the throughput and the latency of CDBTune which is an existing optimal tuning method.

7 citations


Journal ArticleDOI
TL;DR: In this article, the authors present a communication library, called FDGLib, which can easily scale out any existing single FPGA-based graph accelerator to a distributed version in a data center, with minimal hardware engineering efforts.
Abstract: With the rapid growth of real-world graphs, the size of which can easily exceed the on-chip (board) storage capacity of an accelerator, processing large-scale graphs on a single Field Programmable Gate Array (FPGA) becomes difficult. The multi-FPGA acceleration is of great necessity and importance. Many cloud providers (e.g., Amazon, Microsoft, and Baidu) now expose FPGAs to users in their data centers, providing opportunities to accelerate large-scale graph processing. In this paper, we present a communication library, called FDGLib, which can easily scale out any existing single FPGA-based graph accelerator to a distributed version in a data center, with minimal hardware engineering efforts. FDGLib provides six APIs that can be easily used and integrated into any FPGA-based graph accelerator with only a few lines of code modifications. Considering the torus-based FPGA interconnection in data centers, FDGLib also improves communication efficiency using simple yet effective torus-friendly graph partition and placement schemes. We interface FDGLib into AccuGraph, a state-of-the-art graph accelerator. Our results on a 32-node Microsoft Catapult-like data center show that the distributed AccuGraph can be 2.32x and 4.77x faster than a state-of-the-art distributed FPGA-based graph accelerator ForeGraph and a distributed CPU-based graph system Gemini, with better scalability.

6 citations


Journal ArticleDOI
TL;DR: In this article, a survey of 3D shape editing methods from the geometric viewpoint to neural deformation techniques and categorization them into organic shape editing and man-made model editing methods is presented.
Abstract: 3D shape editing is widely used in a range of applications such as movie production, computer games and computer aided design. It is also a popular research topic in computer graphics and computer vision. In past decades, researchers have developed a series of editing methods to make the editing process faster, more robust, and more reliable. Traditionally, the deformed shape is determined by the optimal transformation and weights for an energy formulation. With increasing availability of 3D shapes on the Internet, data-driven methods were proposed to improve the editing results. More recently as the deep neural networks became popular, many deep learning based editing methods have been developed in this field, which are naturally data-driven. We mainly survey recent research studies from the geometric viewpoint to those emerging neural deformation techniques and categorize them into organic shape editing methods and man-made model editing methods. Both traditional methods and recent neural network based methods are reviewed.

6 citations


Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a robust 3D object tracking method with adaptively weighted local bundles called AWLB tracker to handle more complicated cases, where each bundle represents a local region containing a set of local features.
Abstract: The 3D object tracking from a monocular RGB image is a challenging task. Although popular color and edge-based methods have been well studied, they are only applicable to certain cases and new solutions to the challenges in real environment must be developed. In this paper, we propose a robust 3D object tracking method with adaptively weighted local bundles called AWLB tracker to handle more complicated cases. Each bundle represents a local region containing a set of local features. To alleviate the negative effect of the features in low-confidence regions, the bundles are adaptively weighted using a spatially-variant weighting function based on the confidence values of the involved energy terms. Therefore, in each frame, the weights of the energy items in each bundle are adapted to different situations and different regions of the same frame. Experiments show that the proposed method can improve the overall accuracy in challenging cases. We then verify the effectiveness of the proposed confidence-based adaptive weighting method using ablation studies and show that the proposed method overperforms the existing single-feature methods and multi-feature methods without adaptive weighting.

6 citations


Journal ArticleDOI
TL;DR: In this paper, a heuristic sampling algorithm is proposed to generate the required subset by designing two scoring functions: one based on the chi-square test and the other based on likelihood functions.
Abstract: Sampling is a fundamental method for generating data subsets. As many data analysis methods are developed based on probability distributions, maintaining distributions when sampling can help to ensure good data analysis performance. However, sampling a minimum subset while maintaining probability distributions is still a problem. In this paper, we decompose a joint probability distribution into a product of conditional probabilities based on Bayesian networks and use the chi-square test to formulate a sampling problem that requires that the sampled subset pass the distribution test to ensure the distribution. Furthermore, a heuristic sampling algorithm is proposed to generate the required subset by designing two scoring functions: one based on the chi-square test and the other based on likelihood functions. Experiments on four types of datasets with a size of 60 000 show that when the significant difference level, α, is set to 0:05, the algorithm can exclude 99:9%, 99:0%, 93:1% and 96:7% of the samples based on their Bayesian networks—ASIA, ALARM, HEPAR2, and ANDES, respectively. When subsets of the same size are sampled, the subset generated by our algorithm passes all the distribution tests and the average distribution difference is approximately 0:03; by contrast, the subsets generated by random sampling pass only 83:8% of the tests, and the average distribution difference is approximately 0:24.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a new approach based on similarity fusion to predict synthetic lethality (SL) pairs, where multiple types of gene similarity measures are integrated and k-NN is applied to achieve the similarity-based classification task between gene pairs.
Abstract: The synthetic lethality (SL) relationship arises when a combination of deficiencies in two genes leads to cell death, whereas a deficiency in either one of the two genes does not. The survival of the mutant tumor cells depends on the SL partners of the mutant gene, thereby the cancer cells could be selectively killed by inhibiting the SL partners of the oncogenic genes but normal cells could not. Therefore, there is an urgent need to develop more efficient computational methods of SL pairs identification for cancer targeted therapy. In this paper, we propose a new approach based on similarity fusion to predict SL pairs. Multiple types of gene similarity measures are integrated and k-nearest neighbors algorithm (k-NN) is applied to achieve the similarity-based classification task between gene pairs. As a similarity-based method, our method demonstrated excellent performance in multiple experiments. Besides the effectiveness of our method, the ease of use and expansibility can also make our method more widely used in practice.

Journal ArticleDOI
TL;DR: Automatic text summarization (ATS) has achieved impressive performance thanks to recent advances in deep learning (DL) and the availability of large-scale corpora as discussed by the authors, however, there is still a lack of comprehensive literature review for DL-based ATS approaches.
Abstract: Automatic text summarization (ATS) has achieved impressive performance thanks to recent advances in deep learning (DL) and the availability of large-scale corpora. The key points in ATS are to estimate the salience of information and to generate coherent results. Recently, a variety of DL-based approaches have been developed for better considering these two aspects. However, there is still a lack of comprehensive literature review for DL-based ATS approaches. The aim of this paper is to comprehensively review significant DL-based approaches that have been proposed in the literature with respect to the notion of generic ATS tasks and provide a walk-through of their evolution. We first give an overview of ATS and DL. The comparisons of the datasets are also given, which are commonly used for model training, validation, and evaluation. Then we summarize single-document summarization approaches. After that, an overview of multi-document summarization approaches is given. We further analyze the performance of the popular ATS models on common datasets. Various popular approaches can be employed for different ATS tasks. Finally, we propose potential research directions in this fast-growing field. We hope this exploration can provide new insights into future research of DL-based ATS.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a source-free unsupervised domain adaptation (STDA) which jointly models domain adaptation and sample transport learning, and achieved knowledge adaptation to the target domain and attain confident labels for it.
Abstract: Unsupervised domain adaptation (UDA) has achieved great success in handling cross-domain machine learning applications. It typically benefits the model training of unlabeled target domain by leveraging knowledge from labeled source domain. For this purpose, the minimization of the marginal distribution divergence and conditional distribution divergence between the source and the target domain is widely adopted in existing work. Nevertheless, for the sake of privacy preservation, the source domain is usually not provided with training data but trained predictor (e.g., classifier). This incurs the above studies infeasible because the marginal and conditional distributions of the source domain are incalculable. To this end, this article proposes a source-free UDA which jointly models domain adaptation and sample transport learning, namely Sample Transport Domain Adaptation (STDA). Specifically, STDA constructs the pseudo source domain according to the aggregated decision boundaries of multiple source classifiers made on the target domain. Then, it refines the pseudo source domain by augmenting it through transporting those target samples with high confidence, and consequently generates labels for the target domain. We train the STDA model by performing domain adaptation with sample transport between the above steps in alternating manner, and eventually achieve knowledge adaptation to the target domain and attain confident labels for it. Finally, evaluation results have validated effectiveness and superiority of the proposed method.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a novel matrix factorization method, so-called collaborative matrix factorisation with soft regularization (SRCMF), which improves the prediction performance by combining the drug and the target similarity information with matrix factorizations.
Abstract: Identifying the potential drug-target interactions (DTI) is critical in drug discovery. The drug-target interaction prediction methods based on collaborative filtering have demonstrated attractive prediction performance. However, many corresponding models cannot accurately express the relationship between similarity features and DTI features. In order to rationally represent the correlation, we propose a novel matrix factorization method, so-called collaborative matrix factorization with soft regularization (SRCMF). SRCMF improves the prediction performance by combining the drug and the target similarity information with matrix factorization. In contrast to general collaborative matrix factorization, the fundamental idea of SRCMF is to make the similarity features and the potential features of DTI approximate, not identical. Specifically, SRCMF obtains low-rank feature representations of drug similarity and target similarity, and then uses a soft regularization term to constrain the approximation between drug (target) similarity features and drug (target) potential features of DTI. To comprehensively evaluate the prediction performance of SRCMF, we conduct cross-validation experiments under three different settings. In terms of the area under the precision-recall curve (AUPR), SRCMF achieves better prediction results than six state-of-the-art methods. Besides, under different noise levels of similarity data, the prediction performance of SRCMF is much better than that of collaborative matrix factorization. In conclusion, SRCMF is robust leading to performance improvement in drug-target interaction prediction.

Journal ArticleDOI
TL;DR: In this paper, the authors present a comprehensive performance evaluation of Phytium 2000+, an ARMv8-based 64-core architecture, focusing on the cache and memory subsystems, analyzing the characteristics that impact the high performance computing applications.
Abstract: This article presents a comprehensive performance evaluation of Phytium 2000+, an ARMv8-based 64-core architecture. We focus on the cache and memory subsystems, analyzing the characteristics that impact the high-performance computing applications. We provide insights into the memory-relevant performance behaviours of the Phytium 2000+ system through micro-benchmarking. With the help of the well-known rooine model, we analyze the Phytium 2000+ system, taking both memory accesses and computations into account. Based on the knowledge gained from these micro-benchmarks, we evaluate two applications and use them to assess the capabilities of the Phytium 2000+ system. The results show that the ARMv8-based many-core system is capable of delivering high performance for a wide range of scientific kernels.

Journal ArticleDOI
TL;DR: SE-Chain this article is a scale-out blockchain model that improves storage scalability under the premise of ensuring safety and achieving efficient retrieval, which consists of three parts: the data layer, the processing layer and the storage layer.
Abstract: Massive data is written to blockchain systems for the destination of keeping safe. However, existing blockchain protocols still demand that each full node has to contain the entire chain. Most nodes quit because they are unable to grow their storage space with the size of data. As the number of nodes decreases, the security of blockchains would significantly reduce. We present SE-Chain, a novel scale-out blockchain model that improves storage scalability under the premise of ensuring safety and achieves efficient retrieval. The SE-Chain consists of three parts: the data layer, the processing layer and the storage layer. In the data layer, each transaction is stored in the AB-M tree (Adaptive Balanced Merkle tree), which adaptively combines the advantages of balanced binary tree (quick retrieval) and Merkle tree (quick verification). In the processing layer, the full nodes store the part of the complete chain selected by the duplicate ratio regulation algorithm. Meanwhile, the node reliability verification method is used for increasing the stability of full nodes and reducing the risk of imperfect data recovering caused by the reduction of duplicate number in the storage layer. The experimental results on real datasets show that the query time of SE-Chain based on the AB-M tree is reduced by 17% when 16 nodes exist. Overall, SE-Chain improves the storage scalability extremely and implements efficient querying of transactions.

Journal ArticleDOI
Ling-Yun Dai1, Jin-Xing Liu1, Rong Zhu1, Juan Wang1, Sha-Sha Yuan1 
TL;DR: In this article, a logistic weighted profile-based bi-random walk method (LWBRW) is designed to infer potential miRNA-disease associations (MDAs) based on known MDAs.
Abstract: MicroRNAs (miRNAs) exert an enormous influence on cell differentiation, biological development and the onset of diseases. Because predicting potential miRNA-disease associations (MDAs) by biological experiments usually requires considerable time and money, a growing number of researchers are working on developing computational methods to predict MDAs. High accuracy is critical for prediction. To date, many algorithms have been proposed to infer novel MDAs. However, they may still have some drawbacks. In this paper, a logistic weighted profile-based bi-random walk method (LWBRW) is designed to infer potential MDAs based on known MDAs. In this method, three networks (i.e., a miRNA functional similarity network, a disease semantic similarity network and a known MDA network) are constructed first. In the process of building the miRNA network and the disease network, Gaussian interaction profile (GIP) kernel is computed to increase the kernel similarities, and the logistic function is used to extract valuable information and protect known MDAs. Next, the known MDA matrix is preprocessed by the weighted K-nearest known neighbours (WKNKN) method to reduce the number of false negatives. Then, the LWBRW method is applied to infer novel MDAs by bi-randomly walking on the miRNA network and the disease network. Finally, the predictive ability of the LWBRW method is confirmed by the average AUC of 0.939 3 (0.006 1) in 5-fold cross-validation (CV) and the AUC value of 0.976 3 in leave-one-out cross-validation (LOOCV). In addition, case studies also show the outstanding ability of the LWBRW method to explore potential MDAs.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed an approach combining textual and change similarities to automatically detect duplicate contributions in the pull-based model at submission time, and the evaluation showed that 83.4% of the duplicates can be found in average when they use the combined text and change similarity compared with 54.8% using only textual similarity and 78.2% using change similarity.
Abstract: Communication and coordination between open source software (OSS) developers who do not work physically in the same location have always been the challenging issues. The pull-based development model, as the state-of-the-art collaborative development mechanism, provides high openness and transparency to improve the visibility of contributors’ work. However, duplicate contributions may still be submitted by more than one contributor to solve the same problem due to the parallel and uncoordinated nature of this model. If not detected in time, duplicate pull-requests can cause contributors and reviewers to waste time and energy on redundant work. In this paper, we propose an approach combining textual and change similarities to automatically detect duplicate contributions in the pull-based model at submission time. For a new-arriving contribution, we first compute textual similarity and change similarity between it and other existing contributions. And then our method returns a list of candidate duplicate contributions that are most similar to the new contribution in terms of the combined textual and change similarity. The evaluation shows that 83.4% of the duplicates can be found in average when we use the combined textual and change similarity compared with 54.8% using only textual similarity and 78.2% using only change similarity.

Journal ArticleDOI
TL;DR: In this paper, a knowledge graph for MOOCs on four major platforms: Coursera, EDX, XuetangX, and ICourse is presented, which stores five classes, 11 kinds of relations and 52 779 entities with their corresponding properties, amounting to more than 300 000 triples.
Abstract: To use educational resources efficiently and dig out the nature of relations among MOOCs (massive open online courses), a knowledge graph was built for MOOCs on four major platforms: Coursera, EDX, XuetangX, and ICourse. This paper demonstrates the whole process of educational knowledge graph construction for reference. And this knowledge graph, the largest knowledge graph of MOOC resources at present, stores and represents five classes, 11 kinds of relations and 52 779 entities with their corresponding properties, amounting to more than 300 000 triples. Notably, 24 188 concepts are extracted from text attributes of MOOCs and linked them directly with corresponding Wikipedia entries or the closest entries calculated semantically, which provides the normalized representation of knowledge and a more precise description for MOOCs far more than enriching words with explanatory links. Besides, prerequisites discovered by direct extractions are viewed as an essential supplement to augment the connectivity in the knowledge graph. This knowledge graph could be considered as a collection of unified MOOC resources for learners and the abundant data for researchers on MOOC-related applications, such as prerequisites mining.

Journal ArticleDOI
TL;DR: The authors conducted an experimental comparison for the effects of missing, inconsistent, and conflicting data on classification and clustering models and observed that dirty-data impacts are related to the error type, the error rate, and the data size.
Abstract: Data quality issues have attracted widespread attentions due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate model with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent, and conflicting data on classification and clustering models. From the experimental results, we observe that dirty-data impacts are related to the error type, the error rate, and the data size. Based on the findings, we suggest users leverage our proposed metrics, sensibility and data quality inflection point, for model selection and data cleaning.

Journal ArticleDOI
TL;DR: PIM-Align as mentioned in this paper is an application-driven near-data processing architecture for sequence alignment, which takes advantage of 3D-stacked dynamic random access memory (DRAM) technology.
Abstract: Genomic sequence alignment is the most critical and time-consuming step in genomic analysis. Alignment algorithms generally follow a seed-and-extend model. Acceleration of the extension phase for sequence alignment has been well explored in computing-centric architectures on field-programmable gate array (FPGA), application-specific integrated circuit (ASIC), and graphics processing unit (GPU) (e.g., the Smith-Waterman algorithm). Compared with the extension phase, the seeding phase is more critical and essential. However, the seeding phase is bounded by memory, i.e., fine-grained random memory access and limited parallelism on conventional system. In this paper, we argue that the processing-in-memory (PIM) concept could be a viable solution to address these problems. This paper describes “PIM-Align”—application-driven near-data processing architecture for sequence alignment. In order to achieve memory-capacity proportional performance by taking advantage of 3D-stacked dynamic random access memory (DRAM) technology, we propose a lightweight message mechanism between different memory partitions, and a specialized hardware prefetcher for memory access patterns of sequence alignment. Our evaluation shows that the proposed architecture can achieve 20x and 1 820x speedup when compared with the best available ASIC implementation and the software running on 32-thread CPU, respectively.

Journal ArticleDOI
Yan-Hong Fan1, Meiqin Wang1, Yanbin Li1, Kai Hu1, Muzhou Li1 
TL;DR: Based on the recovered key, the attacker could create a malicious firmware update and load it to Philip Hue lamps to cause Internet of Things (IoT) security issues as mentioned in this paper, and the proposed scheme applied in IoT terminal devices includes two aspects of design (i.e., bootloader and application layer).
Abstract: In the IEEE S&P 2017, Ronen et al. exploited side-channel power analysis (SCPA) and approximately 5 000 power traces to recover the global AES-CCM key that Philip Hue lamps use to decrypt and authenticate new firmware. Based on the recovered key, the attacker could create a malicious firmware update and load it to Philip Hue lamps to cause Internet of Things (IoT) security issues. Inspired by the work of Ronen et al., we propose an AES-CCM-based firmware update scheme against SCPA and denial of service (DoS) attacks. The proposed scheme applied in IoT terminal devices includes two aspects of design (i.e., bootloader and application layer). Firstly, in the bootloader, the number of updates per unit time is limited to prevent the attacker from acquiring a sufficient number of useful traces in a short time, which can effectively counter an SCPA attack. Secondly, in the application layer, using the proposed handshake protocol, the IoT device can access the IoT server to regain update permission, which can defend against DoS attacks. Moreover, on the STM32F405+M25P40 hardware platform, we implement Philips’ and the proposed modified schemes. Experimental results show that compared with the firmware update scheme of Philips Hue smart lamps, the proposed scheme additionally requires only 2.35 KB of Flash memory and a maximum of 0.32 s update time to effectively enhance the security of the AES-CCM-based firmware update process.

Journal ArticleDOI
TL;DR: In this paper, a cache-cOnscious Learned INdex (colin) is proposed, which adopts an in-place approach to support insertions and reserves some empty slots in a node to optimize the node's data placement.
Abstract: The recently proposed learned index has higher query performance and space efficiency than the conventional B+-tree. However, the original learned index has the problems of insertion failure and unbounded query complexity, meaning that it supports neither insertions nor bounded query complexity. Some variants of the learned index use an out-of-place strategy and a bottom-up build strategy to accelerate insertions and support bounded query complexity, but introduce additional query costs and frequent node splitting operations. Moreover, none of the existing learned indices are cache-friendly. In this paper, aiming to not only support efficient queries and insertions but also offer bounded query complexity, we propose a new learned index called COLIN (Cache-cOnscious Learned INdex). Unlike previous solutions using an out-of-place strategy, COLIN adopts an in-place approach to support insertions and reserves some empty slots in a node to optimize the node’s data placement. In particular, through model-based data placement and cache-conscious data layout, COLIN decouples the local-search boundary from the maximum error of the model. The experimental results on five workloads and three datasets show that COLIN achieves the best read/write performance among all compared indices and outperforms the second best index by 18.4%, 6.2%, and 32.9% on the three datasets, respectively.

Journal ArticleDOI
TL;DR: The findings suggest that the proposed model implementation of a blockchain-based music wallet model for safe and legal listening of audio files has acceptable differences in performance with an ordinary audio player.
Abstract: The works produced within the music industry arepresented to their listeners on a digital platform,taking advantage of technology. The problems of thepast, such as pirated cassettes and CDs, have left theirplace to the problem of copyright protection on digitalplatforms today. Block chain is one of the mostreliable and preferred technologies in recent timesregarding data integrity and data security. In thisstudy, a blockzincir-based music wallet model isproposed for safe and legal listening of audio files.The user's selected audio files are converted intoblock chain structure using different techniques andalgorithms and are kept securely in the user's musicwallet. In the study, performance comparisons aremade with the proposed model application in terms ofthe length of time an ordinary audio player can addnew audio files to the list and the response times ofthe user. The findings suggest that the proposedmodel implementation has acceptable differences inperformance with an ordinary audio player.

Journal ArticleDOI
TL;DR: There are many applications of a WSN such as environmental monitoring, raisingalarms for fires in forests and multi-storied buildings, monitoring habitats of wild animals, monitoring children in a kindergarten, support system in play grounds, monitoring indoor patients in a hospital, precision agriculture,detection of infiltration along international boundaries, tracking an object or a target, etc.
Abstract: A Wireless Sensor Network (WSN) consists of a group of tiny devices called sensors that communicate throughwireless links. Sensors are used to collect data about some parameters and send the collected data for furtherprocessing to a designated station. The designated station is often called command and control center (CCC),fusion center (FC), or sink. Sensors forward the collected data to their leaders or cluster heads, which in turn sendit to the centralized station. There are many applications of a WSN such as environmental monitoring, raisingalarms for fires in forests and multi-storied buildings, monitoring habitats of wild animals, monitoring children ina kindergarten, support system in play grounds, monitoring indoor patients in a hospital, precision agriculture,detection of infiltration along international boundaries, tracking an object or a target, etc.

Journal ArticleDOI
TL;DR: In this article, a light-weight approach of static program slicing, called Symbolic Program Slicing (SymPas), is proposed, which works as a dataflow analysis on LLVM (low-level virtual machine).
Abstract: Program slicing is a technique for simplifying programs by focusing on selected aspects of their behavior. Current mainstream static slicing methods operate on dependence graph PDG (program dependence graph) or SDG (system dependence graph), but these friendly graph representations may be a bit expensive for some users. In this paper we attempt to study a light-weight approach of static program slicing, called Symbolic Program Slicing (SymPas), which works as a dataflow analysis on LLVM (low-level virtual machine). In our SymPas approach, slices are stored in symbolic forms, not in procedures being re-analyzed (cf. procedure summaries). Instead of re-analyzing a procedure multiple times to find its slices for each callling context, we calculate a single symbolic slice which can be instantiated at call sites avoiding re-analysis; SymPas is implemented with LLVM to perform slicing on LLVM intermediate representation (IR). For comparison, we systematically adapt IFDS (interprocedural finite distributive subset) analysis and the SDG-based slicing method (SDG-IFDS) to statically slice IR programs. Evaluated on open-source and benchmark programs, our backward SymPas shows a factor-of-6 reduction in time cost and a factor-of-4 reduction in space cost, compared with backward SDG-IFDS, thus being more efficient. In addition, the result shows that after studying slices from 66 programs, ranging up to 336 800 IR instructions in size, SymPas is highly size-scalable.

Journal ArticleDOI
Bu Heng1, Mingkai Dong1, Jifei Yi1, Binyu Zang1, Haibo Chen1 
TL;DR: In this paper, the performance differences among persistent indexing structures, including persistent hash tables and persistent trees, were evaluated on Intel's Optane DC Persistent Memory Module (PMM).
Abstract: Persistent indexing structures are proposed in response to emerging non-volatile memory (NVM) to provide high performance yet durable indexes. However, due to the lack of real NVM hardware, many prior persistent indexing structures were evaluated via emulation, which varies a lot across different setups and differs from the real deployment. Recently, Intel has released its Optane DC Persistent Memory Module (PMM), which is the first production-ready NVM. In this paper, we revisit popular persistent indexing structures on PMM and conduct comprehensive evaluations to study the performance differences among persistent indexing structures, including persistent hash tables and persistent trees. According to the evaluation results, we find that Cacheline-Conscious Extendible Hashing (CCEH) achieves the best performance among all evaluated persistent hash tables, and Failure-Atomic ShifT B+-Tree (FAST) and Write Optimal Radix Tree (WORT) perform better than other trees. Besides, we find that the insertion performance of hash tables is heavily influenced by data locality, while the insertion latency of trees is dominated by the flush instructions. We also uncover that no existing emulation methods accurately simulate PMM for all the studied data structures. Finally, we provide three suggestions on how to fully utilize PMM for better performance, including using clflushopt/clwb with sfence instead of clflush, flushing continuous data in a batch, and avoiding data access immediately after it is flushed to PMM.

Journal ArticleDOI
TL;DR: In this article, the authors present a new memory-centric view of data accesses and divide the memory cycles at each memory layer into four distinct categories and use them to recursively define the memory access latency and concurrency along the memory hierarchy.
Abstract: Accesses Per Cycle (APC), Concurrent Average Memory Access Time (C-AMAT), and Layered Performance Matching (LPM) are three memory performance models that consider both data locality and memory assess concurrency. The APC model measures the throughput of a memory architecture and therefore reflects the quality of service (QoS) of a memory system. The C-AMAT model provides a recursive expression for the memory access delay and therefore can be used for identifying the potential bottlenecks in a memory hierarchy. The LPM method transforms a global memory system optimization into localized optimizations at each memory layer by matching the data access demands of the applications with the underlying memory system design. These three models have been proposed separately through prior efforts. This paper reexamines the three models under one coherent mathematical framework. More specifically, we present a new memory- centric view of data accesses. We divide the memory cycles at each memory layer into four distinct categories and use them to recursively define the memory access latency and concurrency along the memory hierarchy. This new perspective offers new insights with a clear formulation of the memory performance considering both locality and concurrency. Consequently, the performance model can be easily understood and applied in engineering practices. As such, the memory-centric approach helps establish a unified mathematical foundation for model-driven performance analysis and optimization of contemporary and future memory systems.

Journal ArticleDOI
TL;DR: The authors employ adversarial neural networks (AdvNN) to transfer feature representations from one language to another, which enables feature imitation via the competition between a sentence encoder and a rival language discriminator to generate effective representations.
Abstract: Entity relation classification aims to classify the semantic relationship between two marked entities in a given sentence, and plays a vital role in various natural language processing applications. However, existing studies focus on exploiting mono-lingual data in English, due to the lack of labeled data in other languages. How to effectively benefit from a richly-labeled language to help a poorly-labeled language is still an open problem. In this paper, we come up with a language adaptation framework for cross-lingual entity relation classification. The basic idea is to employ adversarial neural networks (AdvNN) to transfer feature representations from one language to another. Especially, such a language adaptation framework enables feature imitation via the competition between a sentence encoder and a rival language discriminator to generate effective representations. To verify the effectiveness of AdvNN, we introduce two kinds of adversarial structures, dual-channel AdvNN and single-channel AdvNN. Experimental results on the ACE 2005 multilingual training corpus show that our single-channel AdvNN achieves the best performance on both unsupervised and semi-supervised scenarios, yield- ing an improvement of 6.61% and 2.98% over the state-of-the-art, respectively. Compared with baselines which directly adopt a machine translation module, we find that both dual-channel and single-channel AdvNN significantly improve the performances (F1) of cross-lingual entity relation classification. Moreover, extensive analysis and discussion demonstrate the appropriateness and effectiveness of different parameter settings in our language adaptation framework.