scispace - formally typeset

Tokenization (data security)

About: Tokenization (data security) is a(n) research topic. Over the lifetime, 980 publication(s) have been published within this topic receiving 16484 citation(s). The topic is also known as: tokenisation. more


Open accessProceedings ArticleDOI: 10.3115/1219840.1219911
Nizar Habash1, Owen Rambow1Institutions (1)
25 Jun 2005-
Abstract: We present an approach to using a morphological analyzer for tokenizing and morphologically tagging (including part-of-speech tagging) Arabic words in one process. We learn classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the output of the analyzer. We obtain accuracy rates on all tasks in the high nineties. more

486 Citations

Open accessProceedings ArticleDOI: 10.18653/V1/2020.ACL-DEMOS.14
Peng Qi1, Yuhao Zhang1, Yuhui Zhang2, Jason Bolton1  +1 moreInstitutions (2)
16 Mar 2020-
Abstract: We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionality to cover other tasks such as coreference resolution and relation extraction Source code, documentation, and pretrained models for 66 languages are available at https://stanfordnlpgithubio/stanza/ more

424 Citations

Open accessProceedings ArticleDOI: 10.18653/V1/S17-2126
01 Aug 2017-
Abstract: In this paper we present two deep-learning systems that competed at SemEval-2017 Task 4 “Sentiment Analysis in Twitter”. We participated in all subtasks for English tweets, involving message-level and topic-based sentiment polarity classification and quantification. We use Long Short-Term Memory (LSTM) networks augmented with two kinds of attention mechanisms, on top of word embeddings pre-trained on a big collection of Twitter messages. Also, we present a text processing tool suitable for social network messages, which performs tokenization, word normalization, segmentation and spell correction. Moreover, our approach uses no hand-crafted features or sentiment lexicons. We ranked 1st (tie) in Subtask A, and achieved very competitive results in the rest of the Subtasks. Both the word embeddings and our text processing tool are available to the research community. more

Topics: Sentiment analysis (62%), SemEval (56%), Lexical analysis (52%) more

351 Citations

Open accessProceedings ArticleDOI: 10.3115/1219840.1219892
Shubin Zhao1, Ralph Grishman1Institutions (1)
25 Jun 2005-
Abstract: Entity relation detection is a form of information extraction that finds predefined relations between pairs of entities in text. This paper describes a relation detection approach that combines clues from different levels of syntactic processing using kernel methods. Information from three different levels of processing is considered: tokenization, sentence parsing and deep dependency analysis. Each source of information is represented by kernel functions. Then composite kernels are developed to integrate and extend individual kernels so that processing errors occurring at one level can be overcome by information from other levels. We present an evaluation of these methods on the 2004 ACE relation detection task, using Support Vector Machines, and show that each level of syntactic processing contributes useful information for this task. When evaluated on the official test data, our approach produced very competitive ACE value scores. We also compare the SVM with KNN on different kernels. more

Topics: Tree kernel (63%), Information extraction (58%), Kernel method (58%) more

344 Citations

Journal ArticleDOI: 10.1023/B:INRT.0000009441.78971.BE
Paul McNamee1, James Mayfield1Institutions (1)
Abstract: The Cross-Language Evaluation Forum has encouraged research in text retrieval methods for numerous European languages and has developed durable test suites that allow language-specific techniques to be investigated and compared. The labor associated with crafting a retrieval system that takes advantage of sophisticated linguistic methods is daunting. We examine whether language-neutral methods can achieve accuracy comparable to language-specific methods with less concomitant software complexity. Using the CLEF 2002 test set we demonstrate empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages. We show that n e 4 is a good choice for those languages, and document the increased storage and time requirements of the technique. We report on the benefits of and challenges posed by n-grams, and explain peculiarities attendant to bilingual retrieval. Our findings demonstrate clearly that accuracy using n-gram indexing rivals or exceeds accuracy using unnormalized words, for both monolingual and bilingual retrieval. more

341 Citations

No. of papers in the topic in previous years

Top Attributes

Show by:

Topic's top 5 most impactful authors

Ulf Mattsson

13 papers, 234 citations

Nizar Habash

8 papers, 653 citations

Naoaki Okazaki

4 papers, 7 citations

Yigal Rozenberg

4 papers, 24 citations

Thamar Solorio

3 papers, 12 citations

Network Information
Related Topics (5)
Sentiment analysis

22.1K papers, 460.8K citations

88% related
Language model

17.5K papers, 545K citations

88% related

6K papers, 230K citations

88% related
Question answering

14K papers, 375.4K citations

87% related
Machine translation

22.1K papers, 574.4K citations

86% related