Towards filtering undesired short text messages using an online learning approach with semantic indexing

doi:10.1016/J.ESWA.2017.04.055

Journal ArticleDOI

Towards filtering undesired short text messages using an online learning approach with semantic indexing

Renato Moraes Silva, +3 more

- 15 Oct 2017 -

Expert Systems With Applications

- Vol. 83, pp 314-325

Chats0

TLDR

A new hybrid ensemble approach is proposed that combines the predictions obtained by the classifiers using the original text samples along with their variations created by applying text normalization and semantic indexing techniques, which can improve the text content quality and enhance the performance of the expert systems for spamming detection.

Abstract:

A new classifier is presented to detect undesired short text comments.The proposed approach is light, fast, multinomial and offers incremental learning.The impact of applying text normalization and semantic indexing is studied.The results indicate the proposed techniques outperformed most of the approaches.Text normalization and semantic indexing enhanced the classifiers performance. The popularity and reach of short text messages commonly used in electronic communication have led spammers to use them to propagate undesired content. This is often composed by misleading information, advertisements, viruses, and malwares that can be harmful and annoying to users. The dynamic nature of spam messages demands for knowledge-based systems with online learning and, therefore, the most traditional text categorization techniques can not be used. In this study, we introduce the MDLText, a text classifier based on the minimum description length principle, to the context of filtering undesired short text messages. The proposed approach supports incremental learning and, therefore, its predictive model is scalable and can adapt to continuously evolving spamming techniques. It is also fast, with computational cost increasing linearly with the number of samples and features, which is very desirable for expert systems applied to real-time electronic communication. In addition to the dynamic nature of these messages, they are also short and usually poorly written, rife with slangs, symbols, and abbreviations that difficult text representation, learning, and filtering. In this scenario, we also investigated the benefits of using text normalization and semantic indexing techniques. We showed these techniques can improve the text content quality and, consequently, enhance the performance of the expert systems for spamming detection. Based on these findings, we propose a new hybrid ensemble approach that combines the predictions obtained by the classifiers using the original text samples along with their variations created by applying text normalization and semantic indexing techniques. It has the advantages of being independent of the classification method and the results indicated it is efficient to filter undesired short text messages.

Towards filtering undesired short text messages using an online learning approach with semantic indexing

Citations

Machine learning

Data Mining Practical Machine Learning Tools and Techniques

Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks

Towards automatic filtering of fake reviews

Towards automatically filtering fake news in Portuguese

References

Scikit-learn: Machine Learning in Python

Scikit-learn: Machine Learning in Python

Pattern Classification

Data Mining: Practical Machine Learning Tools and Techniques

Artificial Intelligence: A Modern Approach

Related Papers (5)

Text normalization and semantic indexing to enhance Instant Messaging and SMS spam filtering

SMS spam filtering

End-to-end Learning for Short Text Expansion

Improving short text classification using public search engines

Language Detection For Short Text Messages In Social Media.