scispace - formally typeset
Search or ask a question
Author

Lay-Ki Soon

Bio: Lay-Ki Soon is an academic researcher from Monash University Malaysia Campus. The author has contributed to research in topics: XML & Web page. The author has an hindex of 9, co-authored 58 publications receiving 233 citations. Previous affiliations of Lay-Ki Soon include Monash University & Soongsil University.


Papers
More filters
Proceedings ArticleDOI
12 Mar 2014
TL;DR: Experimental results show that the proposed system is working better than current text sentiment analysis tools, as the structure of tweets is not same as regular text.
Abstract: In this paper, we present our preliminary experiments on tweets sentiment analysis This experiment is designed to extract sentiment based on subjects that exist in tweets It detects the sentiment that refers to the specific subject using Natural Language Processing techniques To classify sentiment, our experiment consists of three main steps, which are subjectivity classification, semantic association, and polarity classification The experiment utilizes sentiment lexicons by defining the grammatical relationship between sentiment lexicons and subject Experimental results show that the proposed system is working better than current text sentiment analysis tools, as the structure of tweets is not same as regular text

32 citations

Book ChapterDOI
19 Mar 2018
TL;DR: This paper extracted and cleaned text data from the Reddit database, followed by training a word embedding model that is based on the word2vec skip-gram model, which has a 2% improvement of precision over the next best score.
Abstract: With the creation of word embeddings, research areas around natural language processing, such as sentiment analysis and machine translation, have improved. This has been made possible by the limitless amount of text data available on the internet and the usage of a simple, two-layer neural network. However, it remains to be seen if the domain knowledge used to train word embeddings have an impact on the task the embeddings are being used for, based on the domain knowledge of the task itself. In this paper, we extracted and cleaned text data from the Reddit database, followed by training a word embedding model that is based on the word2vec skip-gram model. Then, the features of this model were used to train a random forest classifier for classifying cyberbully comments. Our model was benchmarked with four pre-trained word embeddings, as well as hand-crafted feature extraction methods. The results show that the domain knowledge of word embeddings do play a part in the task it is being used for, as our model has a 2% improvement of precision over the next best score.

25 citations

Proceedings ArticleDOI
23 Nov 2019
TL;DR: The experimental results show that BLSTM performs best with ELMO in detecting cyberbullying texts, and GRU outperforms in terms of time efficiency, which highlight the limitations of word embeddings models on top of GRU algorithm in cyberbullies detection.
Abstract: Cyberbullying detection has become a pressing need in Internet usage governance due to its harmful consequences. Different approaches have been proposed to tackle this problem, including deep learning. In this paper, an empirical study is conducted to evaluate the effectiveness and efficiency of deep learning algorithms, coupled with word embeddings in detecting cyberbullying texts. Three deep learning algorithms were experimented, namely GRU, LSTM and BLSTM. Data pre-processing steps, including oversampling were performed on the selected social media datasets. For feature representations, four different word embeddings models were explored, including word2vec, GloVe, Reddit and ELMO models. Elmo cares of word context by capturing information from the word surroundings which eliminates some of the shortcomings of pre-trained word embeddings models. For more accurate results, 10-fold cross-validation technique was implemented. The experimental results show that BLSTM performs best with ELMO in detecting cyberbullying texts. The efficiency of each model is also measured by calculating the average time taken for training each model. GRU outperforms in terms of time efficiency. Based on the analysis done on false negative cases, three observations were made, which highlight the limitations of word embeddings models on top of GRU algorithm in cyberbullying detection.

19 citations

Book ChapterDOI
29 May 2012
TL;DR: In this paper, the authors used verb as a determinant in substantiating the existence of protagonist with the assistance of WordNet and the experimental results show that it is viable to use verb in identifying named entity, particularly "people" category and it can be applied in a small text size environment.
Abstract: Named entity recognition (NER) has been a well-studied problem in the area of text mining for locating atomic element into predefined categories, where "name of people" is one of the most commonly studied categories. Numerous new NER techniques have been unfolded to accommodate the needs of the application developed. However, most research works carried out focused on non-fiction domain. Fiction domain exhibits complexity and uncertainty in locating protagonist as it represents name of person in a diverse spectrums, ranging from living things (animals, plants, person) to non-living things (vehicle, furniture). This paper proposes automated protagonist identification in fiction domain, particularly in fairy tales. Verb has been used as a determinant in substantiating the existence of protagonist with the assistance of WordNet. The experimental results show that it is viable to use verb in identifying named entity, particularly "people" category and it can be applied in a small text size environment.

15 citations

Book ChapterDOI
01 Jan 2019
TL;DR: None of the KGs can be considered complete on its own with regard to the relations of an entity, and Wikidata scores the highest in term of the timeliness of the data provided owing to the effort of global community update.
Abstract: Knowledge graphs serve as the primary sources of structured data in many Semantic Web applications. In this paper, the three most popular cross-domain knowledge graphs (KGs), namely, DBpedia, YAGO, and Wikidata were empirically explored and compared. These knowledge graphs were compared from the perspectives of completeness of the relations, timeliness of the data and accessibility of the KG. Three fundamental categories of named entities were queried within the KGs for detailed analysis of the data returned. From the experimental results and findings, Wikidata scores the highest in term of the timeliness of the data provided owing to the effort of global community update, with DBpedia LIVE being the next. Regarding accessibility, it was observed that DBpedia and Wikidata gave continuous access using public SPARQL endpoint, while YAGO endpoints were intermittently inaccessible. With respect to completeness of predicates, none of the KGs have a remarkable lead for any of the selected categories. From the analysis, it is observed that none of the KG can be considered complete on its own with regard to the relations of an entity.

13 citations


Cited by
More filters
01 Jan 2002

9,314 citations

Posted Content
TL;DR: A structured and comprehensive overview of research methods in deep learning-based anomaly detection, grouped state-of-the-art research techniques into different categories based on the underlying assumptions and approach adopted.
Abstract: Anomaly detection is an important problem that has been well-studied within diverse research areas and application domains. The aim of this survey is two-fold, firstly we present a structured and comprehensive overview of research methods in deep learning-based anomaly detection. Furthermore, we review the adoption of these methods for anomaly across various application domains and assess their effectiveness. We have grouped state-of-the-art research techniques into different categories based on the underlying assumptions and approach adopted. Within each category we outline the basic anomaly detection technique, along with its variants and present key assumptions, to differentiate between normal and anomalous behavior. For each category, we present we also present the advantages and limitations and discuss the computational complexity of the techniques in real application domains. Finally, we outline open issues in research and challenges faced while adopting these techniques.

522 citations

Journal ArticleDOI
TL;DR: This work constructed a technology semantic network (TechNet) that covers the elemental concepts in all domains of technology and their semantic associations by mining the complete U.S. patent database from 1976.
Abstract: The growing developments in general semantic networks, knowledge graphs and ontology databases have motivated us to build a large-scale comprehensive semantic network of technology-related data for engineering knowledge discovery, technology search and retrieval, and artificial intelligence for engineering design and innovation. Specially, we constructed a technology semantic network (TechNet) that covers the elemental concepts in all domains of technology and their semantic associations by mining the complete U.S. patent database from 1976. To derive the TechNet, natural language processing techniques were utilized to extract terms from massive patent texts and recent word embedding algorithms were employed to vectorize such terms and establish their semantic relationships. We report and evaluate the TechNet for retrieving terms and their pairwise relevance that is meaningful from a technology and engineering design perspective. The TechNet may serve as an infrastructure to support a wide range of applications, e.g., technical text summaries, search query predictions, relational knowledge discovery, and design ideation support, in the context of engineering and technology, and complement or enrich existing semantic databases. To enable such applications, the TechNet is made public via an online interface and APIs for public users to retrieve technology-related terms and their relevancies.

106 citations

Journal ArticleDOI
TL;DR: The findings offer an additional key to interpret public perception and response to the current global health emergency and raise questions about the effects of attention saturation on people’s collective awareness and risk perception and thus on their tendencies toward behavioral change.
Abstract: Background: The exposure and consumption of information during epidemic outbreaks may alter people’s risk perception and trigger behavioral changes, which can ultimately affect the evolution of the disease. It is thus of utmost importance to map the dissemination of information by mainstream media outlets and the public response to this information. However, our understanding of this exposure-response dynamic during the COVID-19 pandemic is still limited. Objective: The goal of this study is to characterize the media coverage and collective internet response to the COVID-19 pandemic in four countries: Italy, the United Kingdom, the United States, and Canada. Methods: We collected a heterogeneous data set including 227,768 web-based news articles and 13,448 YouTube videos published by mainstream media outlets, 107,898 user posts and 3,829,309 comments on the social media platform Reddit, and 278,456,892 views of COVID-19–related Wikipedia pages. To analyze the relationship between media coverage, epidemic progression, and users’ collective web-based response, we considered a linear regression model that predicts the public response for each country given the amount of news exposure. We also applied topic modelling to the data set using nonnegative matrix factorization. Results: Our results show that public attention, quantified as user activity on Reddit and active searches on Wikipedia pages, is mainly driven by media coverage; meanwhile, this activity declines rapidly while news exposure and COVID-19 incidence remain high. Furthermore, using an unsupervised, dynamic topic modeling approach, we show that while the levels of attention dedicated to different topics by media outlets and internet users are in good accordance, interesting deviations emerge in their temporal patterns. Conclusions: Overall, our findings offer an additional key to interpret public perception and response to the current global health emergency and raise questions about the effects of attention saturation on people’s collective awareness and risk perception and thus on their tendencies toward behavioral change.

86 citations

Journal ArticleDOI
TL;DR: In-depth results shown which deep learning models can be most effective against cyberbullying when directly compared with others and paves the way for future hybrid technologies that may be employed to combat this serious online issue.
Abstract: Cyberbullying is disturbing and troubling online misconduct. It appears in various forms and is usually in a textual format in most social networks. Intelligent systems are necessary for automated detection of these incidents. Some of the recent experiments have tackled this issue with traditional machine learning models. Most of the models have been applied to one social network at a time. The latest research has seen different models based on deep learning algorithms make an impact on the detection of cyberbullying. These detection mechanisms have resulted in efficient identification of incidences while others have limitations of standard identification versions. This paper performs an empirical analysis to determine the effectiveness and performance of deep learning algorithms in detecting insults in Social Commentary. The following four deep learning models were used for experimental results, namely: Bidirectional Long Short-Term Memory (BLSTM), Gated Recurrent Units (GRU), Long Short-Term Memory (LSTM), and Recurrent Neural Network (RNN). Data pre-processing steps were followed that included text cleaning, tokenization, stemming, Lemmatization, and removal of stop words. After performing data pre-processing, clean textual data is passed to deep learning algorithms for prediction. The results show that the BLSTM model achieved high accuracy and F1-measure scores in comparison to RNN, LSTM, and GRU. Our in-depth results shown which deep learning models can be most effective against cyberbullying when directly compared with others and paves the way for future hybrid technologies that may be employed to combat this serious online issue.

63 citations