scispace - formally typeset
Proceedings ArticleDOI

Deep Learning Based Unsupervised POS Tagging for Sanskrit

21 Dec 2018-

TL;DR: A deep learning based approach to assign POS tags to words in a piece of text given to it as input and uses the untagged Sanskrit Corpus prepared by JNU for the tag assignment purpose and determining model accuracy.

AbstractIn this paper, we present a deep learning based approach to assign POS tags to words in a piece of text given to it as input. We propose an unsupervised approach owing to the lack of a large Sanskrit annotated corpora and use the untagged Sanskrit Corpus prepared by JNU for our purpose. The only tagged corpora for Sanskrit is created by JNU which has 115,000 words which are not sufficient to apply supervised deep learning approaches. For the tag assignment purpose and determining model accuracy, we utilize this tagged corpus. We explore various methods through which each Sanskrit word can be represented as a point multi-dimensional vector space whose position accurately captures its meaning and semantic information associated with it. We also explore other data sources to improve performance and robustness of the vector representations. We use these rich vector representations and explore autoencoder based approaches for dimensionality reduction to compress these into encodings which are suitable for clustering in the vector space. We experiment with different dimensions of these compressed representations and present one which was found to offer the best clustering performance. For modelling the sequence in order to preserve the semantic information we feed these embeddings to a bidirectional LSTM autoencoder. We assign a POS tag to each of the obtained clusters and produce our result by testing the model on the tagged corpus.

...read more


Citations
More filters
Book ChapterDOI
01 Jan 2020
TL;DR: Here, 328 Sanskrit words are tested through four morphological analyzers namely—Samsaadhanii, morphological Analyzers by JNU and TDIL, both of which are available online and locally developed and installed Sanguj morphological analyzezer.
Abstract: In linguistics, morphology is a study regarding word, word formation, its analysis, and generation. A morphological analyzer is a tool to understand grammatical characteristics and constituent’s part-of-speech information. A morphological analyzer is a useful tool in many NLP implementations such as syntactic parser, spell checker, information retrieval, and machine translation. Here, 328 Sanskrit words are tested through four morphological analyzers namely—Samsaadhanii, morphological analyzers by JNU and TDIL, both of which are available online and locally developed and installed Sanguj morphological analyzer. There is a negligible divergence in the reflected results.

1 citations

Proceedings ArticleDOI
09 Oct 2020
TL;DR: A model is advocated that employs a deep learning method to train the LSTM (Long Short Term Memory) neural network trained over a massive data set to fulfill the necessary categorisation, using a context-based retention of the data attained through Word2Vec along with the TensorFlow and Keras packages.
Abstract: Language is the most fundamental and historically normal means of communication today. Grammar plays a critical role in the excellence of a language. As individuals have already been educated throughout our existence with an accumulation of knowledge that is accrued, mastered over time with guidelines and a restriction of significance that allows us to comprehend and interact one another. But also to translate such awareness into a computer, to be capable of interpreting and classifying contextual evidence into a proper syntactical form, thereby validating that the information was in the correct form, is incredibly necessary at the current time since it is a sophisticated activity. The paper addresses the issue and asserts the advancement of such grammar verifying mechanism for the Dravidian language Kannada. Among the first account would be that the intricacy of the language poses a problem and preferring to have a rule based stance is an easier route and makes it possible to identify detected flaws competently. It takes a linguistic specialist to compile hundreds of parallel standards that are difficult to preserve. Here, a model is advocated that employs a deep learning method to train the LSTM (Long Short Term Memory) neural network trained over a massive data set to fulfill the necessary categorisation, using a context-based retention of the data attained through Word2Vec along with the TensorFlow and Keras packages. The proposed system is able to perform Grammatical Error Detection (GED) effectively.

Cites background from "Deep Learning Based Unsupervised PO..."

  • ...And delegate the POS tag to the received cluster and refer the model to the label lexicon to yield the result[4]....

    [...]

  • ...Insufficient flagged corpus for languages such as Sanskrit, Kannada and other resource-poor languages is a hindrance to all of these modern supervised learning algorithms, which typically involve rich data sets [4]....

    [...]

  • ...has been shown to be beneficial in the NLP application[4]....

    [...]

Journal ArticleDOI
TL;DR: In this paper, a deep neural network model was proposed to improve the accuracy of parts-of-speech tagging in low-resource languages, such as Assamese and English.
Abstract: Over the years, many different algorithms are proposed to improve the accuracy of the automatic parts of speech tagging. High accuracy of parts of speech tagging is very important for any NLP application. Powerful models like The Hidden Markov Model (HMM), used for this purpose require a huge amount of training data and are also less accurate to detect unknown (untrained) words. Most of the languages in this world lack enough resources in the computable form to be used during training such models. NLP applications for such languages also encounter many unknown words during execution. This results in a low accuracy rate. Improving accuracy for such low-resource languages is an open problem. In this paper, one stochastic method and a deep learning model are proposed to improve accuracy for such languages. The proposed language-independent methods improve unknown word accuracy and overall accuracy with a low amount of training data. At first, bigrams and trigrams of characters that are already part of training samples are used to calculate the maximum likelihood for tagging unknown words using the Viterbi algorithm and HMM. With training datasets below the size of 10K, an improvement of 12% to 14% accuracy has been achieved. Next, a deep neural network model is also proposed to work with a very low amount of training data. It is based on word level, character level, character bigram level, and character trigram level representations to perform parts of speech tagging with less amount of available training data. The model improves the overall accuracy of the tagger along with improving accuracy for unknown words. Results for “English” and a low resource Indian Language “Assamese” are discussed in detail. Performance is better than many state-of-the-art techniques for low resource language. The method is generic and can be used with any language with very less amount of training data.

References
More filters
Proceedings Article
Tomas Mikolov1, Ilya Sutskever1, Kai Chen1, Greg S. Corrado1, Jeffrey Dean1 
05 Dec 2013
TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

23,982 citations


"Deep Learning Based Unsupervised PO..." refers methods in this paper

  • ...The skipgram approach as in [10] is based on a log bilinear model which is trained to predict an unordered set of context words given an center word of a training context window....

    [...]

Journal ArticleDOI
TL;DR: This paper proposed a new approach based on skip-gram model, where each word is represented as a bag of character n-grams, words being represented as the sum of these representations, allowing to train models on large corpora quickly and allowing to compute word representations for words that did not appear in the training data.
Abstract: Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.

6,288 citations

Proceedings ArticleDOI
27 Mar 1995
TL;DR: The authors presented an algorithm for tagging words whose part-of-speech properties are unknown, which categorizes word tokens in context instead of word types, and evaluated on the Brown Corpus.
Abstract: This paper presents an algorithm for tagging words whose part-of-speech properties are unknown. Unlike previous work, the algorithm categorizes word tokens in context instead of word types. The algorithm is evaluated on the Brown Corpus.

247 citations

Proceedings ArticleDOI
01 Jan 2015
TL;DR: This paper showed that word embeddings can also add value to the problem of unsupervised POS induction, replacing multinomial distributions over the vocabulary with multivariate Gaussian distributions over word embedding and observe consistent improvements in eight languages.
Abstract: Unsupervised word embeddings have been shown to be valuable as features in supervised learning problems; however, their role in unsupervised problems has been less thoroughly explored. In this paper, we show that embeddings can likewise add value to the problem of unsupervised POS induction. In two representative models of POS induction, we replace multinomial distributions over the vocabulary with multivariate Gaussian distributions over word embeddings and observe consistent improvements in eight languages. We also analyze the e ect of various choices while inducing word embeddings on “downstream” POS induction results.

66 citations

Posted Content
TL;DR: Competitive results with instantiations of the framework for unsupervised learning of structured predictors with overlapping, global features are shown, and it is shown that training the proposed model can be substantially more efficient than a comparable feature-rich baseline.
Abstract: We introduce a framework for unsupervised learning of structured predictors with overlapping, global features. Each input's latent representation is predicted conditional on the observable data using a feature-rich conditional random field. Then a reconstruction of the input is (re)generated, conditional on the latent structure, using models for which maximum likelihood estimation has a closed-form. Our autoencoder formulation enables efficient learning without making unrealistic independence assumptions or restricting the kinds of features that can be used. We illustrate insightful connections to traditional autoencoders, posterior regularization and multi-view learning. We show competitive results with instantiations of the model for two canonical NLP tasks: part-of-speech induction and bitext word alignment, and show that training our model can be substantially more efficient than comparable feature-rich baselines.

62 citations


"Deep Learning Based Unsupervised PO..." refers methods in this paper

  • ...In a paper [3], a Conditional Random Field Autoencoder was used to predict the latent representation of the input....

    [...]