scispace - formally typeset
Search or ask a question
Author

Thamar Solorio

Bio: Thamar Solorio is an academic researcher from University of Houston. The author has contributed to research in topics: Named-entity recognition & Task (project management). The author has an hindex of 30, co-authored 175 publications receiving 3308 citations. Previous affiliations of Thamar Solorio include University of Alabama at Birmingham & National University of Colombia.


Papers
More filters
Proceedings ArticleDOI
08 Oct 2010
TL;DR: This paper explores the possibility of utilizing confidence weighted classification combined with content based phishing URL detection to produce a dynamic and extensible system for detection of present and emerging types of phishing domains.
Abstract: Phishing is a form of cybercrime where spammed emails and fraudulent websites entice victims to provide sensitive information to the phishers. The acquired sensitive information is subsequently used to steal identities or gain access to money. This paper explores the possibility of utilizing confidence weighted classification combined with content based phishing URL detection to produce a dynamic and extensible system for detection of present and emerging types of phishing domains. Our system is capable of detecting emerging threats as they appear and subsequently can provide increased protection against zero hour threats unlike traditional blacklisting techniques which function reactively.

196 citations

Proceedings ArticleDOI
01 Oct 2014
TL;DR: The evaluation showed that language identification at the token level is more difficult when the languages present are closely related, as in the case of MSA-DA, where the prediction performance was the lowest among all language pairs.
Abstract: We present an overview of the first shared task on language identification on codeswitched data. The shared task included code-switched data from four language pairs: Modern Standard ArabicDialectal Arabic (MSA-DA), MandarinEnglish (MAN-EN), Nepali-English (NEPEN), and Spanish-English (SPA-EN). A total of seven teams participated in the task and submitted 42 system runs. The evaluation showed that language identification at the token level is more difficult when the languages present are closely related, as in the case of MSA-DA, where the prediction performance was the lowest among all language pairs. In contrast, the language pairs with the higest F-measure where SPA-EN and NEP-EN. The task made evident that language identification in code-switched data is still far from solved and warrants further research.

174 citations

Proceedings ArticleDOI
01 Jan 2015
TL;DR: It is demonstrated that characterngrams that capture information about affixes and punctuation account for almost all of the power of character n-grams as features.
Abstract: Character n-grams have been identified as the most successful feature in both singledomain and cross-domain Authorship Attribution (AA), but the reasons for their discriminative value were not fully understood. We identify subgroups of charactern-grams that correspond to linguistic aspects commonly claimed to be covered by these features: morphosyntax, thematic content and style. We evaluate the predictiveness of each of these groups in two AA settings: a single domain setting and a cross-domain setting where multiple topics are present. We demonstrate that characterngrams that capture information about affixes and punctuation account for almost all of the power of character n-grams as features. Our study contributes new insights into the use of n-grams for future AA work and other classification tasks.

165 citations

Proceedings ArticleDOI
01 Apr 2017
TL;DR: A model to perform authorship attribution of tweets using Convolutional Neural Networks over character n-grams and a strategy that improves model interpretability by estimating the importance of input text fragments in the predicted classification are presented.
Abstract: We present a model to perform authorship attribution of tweets using Convolutional Neural Networks (CNNs) over character n-grams. We also present a strategy that improves model interpretability by estimating the importance of input text fragments in the predicted classification. The experimental evaluation shows that text CNNs perform competitively and are able to outperform previous methods.

158 citations

Posted Content
TL;DR: In this paper, a novel model for multimodal learning based on gated neural networks is presented, which is intended to be used as an internal unit in a neural network architecture whose purpose is to find an intermediate representation based on a combination of data from different modalities.
Abstract: This paper presents a novel model for multimodal learning based on gated neural networks. The Gated Multimodal Unit (GMU) model is intended to be used as an internal unit in a neural network architecture whose purpose is to find an intermediate representation based on a combination of data from different modalities. The GMU learns to decide how modalities influence the activation of the unit using multiplicative gates. It was evaluated on a multilabel scenario for genre classification of movies using the plot and the poster. The GMU improved the macro f-score performance of single-modality approaches and outperformed other fusion strategies, including mixture of experts models. Along with this work, the MM-IMDb dataset is released which, to the best of our knowledge, is the largest publicly available multimodal dataset for genre prediction on movies.

153 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: This work first classify deep multimodal learning architectures and then discusses methods to fuse learned multi-modal representations in deep-learning architectures.
Abstract: The success of deep learning has been a catalyst to solving increasingly complex machine-learning problems, which often involve multiple data modalities. We review recent advances in deep multimodal learning and highlight the state-of the art, as well as gaps and challenges in this active research field. We first classify deep multimodal learning architectures and then discuss methods to fuse learned multimodal representations in deep-learning architectures. We highlight two areas of research–regularization strategies and methods that learn or optimize multimodal fusion structures–as exciting areas for future work.

529 citations