scispace - formally typeset
Open AccessBook ChapterDOI

Analysis of Sanskrit Text: Parsing and Semantic Relations

TLDR
The proposed Sanskrit parser is able to create semantic nets for many classes of Sanskrit paragraphs and is taking care of both external and internal sandhi in the Sanskrit words.
Abstract
In this paper, we are presenting our work towards building a dependency parser for Sanskrit language that uses deterministic finite automata(DFA) for morphological analysis and 'utsarga apavaada' approach for relation analysis A computational grammar based on the framework of Panini is being developed A linguistic generalization for Verbal and Nominal database has been made and declensions are given the form of DFA Verbal database for all the class of verbs have been completed for this part Given a Sanskrit text, the parser identifies the root words and gives the dependency relations based on semantic constraints The proposed Sanskrit parser is able to create semantic nets for many classes of Sanskrit paragraphs(*********************) The parser is taking care of both external and internal sandhi in the Sanskrit words

read more

Content maybe subject to copyright    Report

HAL Id: inria-00203459
https://hal.inria.fr/inria-00203459
Submitted on 10 Jan 2008
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Analysis of Sanskrit text : parsing and semantic relations
Pawan Goyal, Vipul Arora, Laxmidhar Behera
To cite this version:
Pawan Goyal, Vipul Arora, Laxmidhar Behera. Analysis of Sanskrit text : parsing and seman-
tic relations. First International Sanskrit Computational Linguistics Symposium, INRIA Paris-
Rocquencourt, Oct 2007, Rocquencourt, France. �inria-00203459�

ANALYSIS OF SANSKRIT TEXT: PARSING AND SEMANTIC
RELATIONS
Pawan Goyal
Electrical Engineering,
IIT Kanpur,
208016, UP,
India
pawangee@iitk.ac.in
Vipul Arora
Electrical Engineering,
IIT Kanpur,
208016, UP,
India
vipular@iitk.ac.in
Laxmidhar Behera
Electrical Engineering,
IIT Kanpur,
208016, UP,
India
lbehera@iitk.ac.in
Abstract
In this paper, we are presenting our work
towards building a dependency parser for
Sanskrit language that uses determinis-
tic finite automata(DFA) for morpholog-
ical analysis and ’utsarga apavaada’ ap-
proach for relation analysis. A computa-
tional grammar based on the framework
of Panini is being developed. A linguis-
tic generalization for Verbal and Nomi-
nal database has been made and declen-
sions are given the form of DFA. Verbal
database for all the class of verbs have
been completed for this part. Given a
Sanskrit text, the parser identifies the root
words and gives the dependency relations
based on semantic constraints. The pro-
posed Sanskrit parser is able to create
semantic nets for many classes of San-
skrit paragraphs(
 
). The parser is
taking care of both external and internal
sandhi in the Sanskrit words.
1 INTRODUCTION
Parsing is the ”de-linearization” of linguistic in-
put; that is, the use of grammatical rules and other
knowledge sources to determine the functions of
words in the input sentence. Getting an efficient
and unambiguous parse of natural languages has
been a subject of wide interest in the field of
artificial intelligence over past 50 years. Instead
of providing substantial amount of information
manually, there has been a shift towards using
Machine Learning algorithms in every possible
NLP task. Among the most important elements
in this toolkit are state machines, formal rule
systems, logic, as well as probability theory and
other machine learning tools. These models,
in turn, lend themselves to a small number
of algorithms from well-known computational
paradigms. Among the most important of these
are state space search algorithms, (Bonet, 2001)
and dynamic programming algorithms (Ferro,
1998). The need for unambiguous representation
has lead to a great effort in stochastic parsing
(Ivanov, 2000).
Most of the research work has been done for
English sentences but to transmit the ideas with
great precision and mathematical rigor, we need a
language that incorporates the features of artificial
intelligence. Briggs (Briggs,1985) demonstrated
in his article the salient features of Sanskrit
language that can make it serve as an Artificial
language. Although computational processing
of Sanskrit language has been reported in the
literature (Huet, 2005) with some computational
toolkits (Huet, 2002), and there is work going
on towards developing mathematical model and
dependency grammar of Sanskrit(Huet, 2006), the
proposed Sanskrit parser is being developed for
using Sanskrit language as Indian networking lan-
guage (INL). The utility of advanced techniques
such as stochastic parsing and machine learning
in designing a Sanskrit parser need to be verified.
We have used deterministic finite automata
for morphological analysis. We have identified
the basic linguistic framework which shall facili-
tate the effective emergence of Sanskrit as INL. To
achieve this goal, a computational grammar has
been developed for the processing of Sanskrit lan-
guage. Sanskrit has a rich system of inflectional
endings (vibhakti). The computational grammar
described here takes the concept of vibhakti and

karaka relations from Panini framework and uses
them to get an efficient parse for Sanskrit Text.
The grammar is written in ’utsarga apavaada’ ap-
proach i.e rules are arranged in several layers each
layer forming the exception of previous one. We
are working towards encoding Paninian grammar
to get a robust analysis of Sanskrit sentence. The
paninian framework has been successfully applied
to Indian languages for dependency grammars
(Sangal, 1993), where constraint based parsing is
used and mapping between karaka and vibhakti
is via a TAM (tense, aspect, modality) tabel. We
have made rules from Panini grammar for the
mapping. Also, finite state automata is used for
the analysis instead of finite state transducers.
The problem is that the Paninian grammar is
generative and it is just not straight forward to
invert the grammar to get a Sanskrit analyzer, i.e.
its difficult to rely just on Panini sutras to build
the analyzer. There will be lot of ambiguities
(due to options given in Panini sutras, as well
as a single word having multiple analysis). We
need therefore a hybrid scheme which should
take some statistical methods for the analysis of
sentence. Probabilistic approach is currently not
integrated within the parser since we don’t have
a Sanskrit corpus to work with, but we hope that
in very near future, we will be able to apply the
statistical methods.
The paper is arranged as follows. Section 2
explains in a nutshell the computational process-
ing of any Sanskrit corpus. We have codified the
Nominal and Verb forms in Sanskrit in a directly
computable form by the computer. Our algorithm
for processing these texts and preparing Sanskrit
lexicon databases are presented in section 3. The
complete parser has been described in section
4. We have discussed here how we are going
to do morphological analysis and hence relation
analysis. Results have been enumerated in section
5. Discussion, conclusions and future work follow
in section 6.
2 A STANDARD METHOD FOR
ANALYZING SANSKRIT TEXT
The basic framework for analyzing the Sanskrit
corpus is discussed in this section. For every
word in a given sentence, machine/computer is
supposed to identify the word in following struc-
ture. < W ord >< Base >< F orm ><
Relation >.
The structure contains the root word (<Base>)
and its form <attributes of word> and relation
with the verb/action or subject of that sentence.
This analogy is done so as to completely disam-
biguate the meaning of word in the context.
2.1 <Word>
Given a sentence, the parser identifies a singular
word and processes it using the guidelines laid out
in this section. If it is a compound word, then the
compound word with

has to be undone. For
example:
 
=
!"
+
# 
.
2.2 <Base>
The base is the original, uninflected form of the
word. Finite verb forms, other simple words and
compound words are each indicated differently.
For Simple words: The computer activates the
DFA on the ISCII code (ISCII,1999) of the San-
skrit text. For compound words: The computer
shows the nesting of internal and external

using nested parentheses. Undo
$%
changes be-
tween the component words.
2.3 <Form>
The <Form> of a word contains the information
regarding declensions for nominals and state for
verbs.
For undeclined words, just write u in this col-
umn.
For nouns, write first.m, f or n to indicate the
gender, followed by a number for the case (1
through 7, or 8 for vocative), and s, d or p to
indicate singular, dual or plural.
For adjectives and pronouns, write first a, fol-
lowed by the indications, as for nouns, of
gender (skipping this for pronouns unmarked
for gender), case and number.
For verbs, in one column indicate the class
(
&
) and voice. Show the class by a num-
ber from 1 to 11. Follow this (in the same
column) by ’1’ for parasmaipada, ’2’ for
¨atmanepada and ’3’ for ubhayapada. For fi-
nite verb forms, give the root. Then (in the
same column) show the tense as given in Ta-
ble 3. Then show the inflection in the same
column, if there is one. For finite forms, show

Table 1: Codes for
<Form>
pa/ passive
ca/ causative
de/ desiderative
fr/ frequentative
Table 2: Codes for Fi-
nite Forms, showing the
Person and the Number
1
'(
*) +,
2
- ./) 0+,
3
132
*) +,
s singular
d dual
p plural
Table 3: Codes for
Finite verb Forms,
showing the Tense
pr present
if imperfect
iv imperative
op optative
ao aorist
pe perfect
fu future
f2 second future
be benedictive
co conditional
the person and number with the codes given
in Table 2. For participles, show the case and
number as for nouns.
2.4 <Relation>
The relation between the different words in a
sentence is worked out using the information
obtained from the analysis done using the guide-
lines laid out in the previous subsections. First
write down a period in this column followed by
a number indicating the order of the word in the
sentence. The words in each sentence should
be numbered sequentially, even when a sentence
ends before the end of a text or extends over
more than one text. Then, in the same column,
indicate the kind of connection the word has to
the sentence, using the codes given in table 4.
Then, in the same column, give the number
of the other word in the sentence to which this
word is connected as modifier or otherwise. The
relation set given above is not exhaustive. All the
6 karakas are defined as in relation to the verb.
3 ALGORITHM FOR SANSKRIT
RULEBASE
In the section to follow in this paper, we shall
explain two of the procedures/algorithms that we
have developed for the computational analysis of
Sanskrit. Combined with these algorithms, we
Table 4: Codes for <Relation>
v main verb
vs subordinate verb
s subject(of the sentence or a subordinate clause)
o object(of a verb or preposition)
g destination(gati) of a verb of motion
a Adjective
n Noun modifying another in apposition
d predicate nominative
m other modifier
p Preposition
c Conjunction
u vocative, with no syntactic connection
q quoted sentence or phrase
r definition of a word or phrase(in a commentary)
have arrived at the skeletal base upon which many
different modules for Sanskrit linguistic analysis
such as: relations,
$%
,

can be worked
out.
3.1 Sanskrit Rule Database
Every natural language must have a representa-
tion, which is directly computable. To achieve
this we have encoded the grammatical rules
and designed the syntactic structure for both the
nominal and verbal words in Sanskrit. Let us
illustrate this structure for both the nouns and the
verbs with an example each .
Noun:-Any noun has three genders: Mas-
culine,Feminine and Neuter. So also the noun
has three numbers: Singular, Dual and Plural.
Again there exists eight classification in each
number: Nominative, Accusative, Imperative,
Dative, Ablative, Genitive, Locative and Vocative.
Interestingly these express nearly all the relations
between words in a sentence .
In Sanskrit language, every noun is deflected
following a general rule based on the ending al-
phabet such as
#4567
. For example,
68
is in
class
4568
which ends with
(a). Such clas-
sifications are given in Table 5. Each of these have
different inflections depending upon which gender
they correspond to. Thus
#4567
has different
masculine and neuter declensions,
4568
has
masculine and feminine declensions,
9
4$:67
has

masculine, feminine and neuter declensions. We
have then encoded each of the declensions into
ISCII code, so that it can be easily computable
in the computer using the algorithm that we have
developed for the linguistic analysis of any word .
Table 5: attributes of the declension for noun
Class
Case
η
Gender
ζ
;=<?>A@>CBED
(1)
F
<?>C@>CBED
(14)
<?GH>
I
(1)
J K"LCMN
(1)
;=>C<?>C@>CBED
(2)
O
<>C@>ABED
(15)
<?P
I
(2)
Q RTS7LCUN
(2)
V"<?>C@>CBED
(3)
W
<?>C@>CBED
(16)
<?@X
(3)
W0J KZY [
<
LCUN
(3)
\
<?>C@>CBED
(4)
W
<?>A@>CBED
(17
1
)
[^]`_aF
>
W
(4) Number
@
bc<>C@>ABED
(5)
J
<?>C@>CBED
(18)
;
J
>
F
>
W
(5)
d
<?eaf
W
(1)
g<?>A@>CBED
(6)
h
<>C@>ABED
(19)
[^]ji
B
O
(6)
LAk
elf
W
(2)
mn<?>C@>CBED
(7)
@<>C@>ABED
(20)
;
LoO
<?@X
(7)
iap
K
eaf
W
(3)
d q
<?>C@>CBED
(8)
ea<?>C@>CBED
(21)
[^]`iCrOsW
(8)
;
r
<?>C@>CBED
(9)
t
<?>C@>CBED
(22)
;"ua<?>A@>CBED
(10)
v
<?>C@>CBED
(23)
f^<?>C@>CBED
(11)
[
<?>C@>CBED
(24)
w=<?>A@>CBED
(12)
p
<?>C@>CBED
(25)
D0<?>C@>CBED
(13)
Let us illustrate this structure for the noun
with an example . For
#4$:67
, masculine,
nominative, singular declension:
This is encoded in the following syntax:
(163{1
, 1
η
, 1
ζ
, 1
@
}) .
Where 163 is the ISCII code of the declension
(Table 6). The four 1’s in the curly brackets repre-
sent Class, Case, Gender and Number respectively
(Table 5) .
Table 6: Noun example
Masculine
Singular(
x
45y?z
)
Endings ISCII Code
Nominative
{
163
Pronouns:-According to Paninian grammar
and Kale, (Kale) Sanskrit has 35 pronouns which
are:
y|
,
y?}
,
13~
,
13~
.
,
9
6
,
9
H
,
.
,
.H6
,
9
6
,
y
,
y
,
?
,
.
,
.
,
) yT|
,
)?6
,
y6
,

&
,
1
)?6
,
6
,
y
,
6
,
.
,
x

,
9
H
,

,
x
4
,

,
. `
,
~
y
and
45
.
We have classified each of these pronouns into
9 classes: Personal, Demonstrative, Relative, In-
definitive, Correlative, Reciprocal and Possessive.
Each of these pronouns have different inflectional
forms arising from different declensions of the
masculine and feminine form. We have codified
the pronouns in a form similar to that of nouns .
Adjectives:- Adjectives are dealt in the same
manner as nouns. The repetition of the linguistic
morphology is avoided .
Verbs:- A Verb in a sentence in Sanskrit
expresses an action that is enhanced by a set of
auxiliaries”; these auxiliaries being the nominals
that have been discussed previously .
The meaning of the verb is said to be both
vyapara (action, activity, cause), and phala (fruit,
result, effect). Syntactically, its meaning is in-
variably linked with the meaning of the verb ”to
do”. In our analysis of Verbs, we have found that
they are classified into 11 classes(
&
, Table 7).
While coding the endings, each class is subdivided
according to
9
knowledge,
,
and
y
; each of which is again sub-classified as into 3
sub-classes as

 ?)?
,
)?68 ?)?
and
13~
.)
,
which we have denoted as pada. Each verb sub-
class again has 10 lakaaras , which is used to ex-
press the tense of the action. Again, depending
upon the form of the sentence, again a division
of form as
4
2
|`y.
,
4$E|jy?.
and
~
yy?.
has
been done. This classification has been referred
to as voice. This structure has been explained in
Table 7.
Table 7: attributes of the declension for verb
Class
it
γ
pada
η
T ense
ζ
jea>
LCF
X
(1)
[ ql
(1)
;>AEP
W qaJaF
(1)
U
(1)
;
F
>
LCF
X
(2)
;
LCW0
(2)
J
@
Q
P
JlF
(2)
UN
(2)
LCF
ea>
LCF
X
(3)
e
ql
(3)
b
h78JaF
(3)
U
(3)
Q
ea>
LCF
X
(4)
U%ra
(4)
D
K"F
>
LCF
X
(5)
L
e
LoO0LCUN
(5)
0c>
LCF
X
(6)
;=>
tS
I
LCUN
(6)
D
W
>
LCF
X
(7)
LCU
(7)
>
LAF
X
(8)
U
K
(8)
f
K
@>
LCF
X
(9)
U K"N
(9)
w
K
pr
>
LCF
X
(10)
U
N
(10)
<?j
ea>
LCF
X
(11)
V oice
λ
P erson
@
Number
δ
<?G
I
ea>C
(1)
_
P
J K
v
(1)
d
<?eaf
W
(1)
<P
I
ea>C
(2)
P?
P
J K
v
(2)
LAk
elf
W
(2)
h
>Ceael>C
(3)
bcGP
J
K
v
(3)
iap
K
elf
W
(3)
.
Let us express the structure via an example for
y?
8&
,
)?67 ?)
, Present Tense, First person,

Citations
More filters
Proceedings ArticleDOI

Sanskrit Word Segmentation Using Character-level Recurrent and Convolutional Neural Networks

TL;DR: End-to-end neural network models that tokenize Sanskrit by jointly splitting compounds and resolving phonetic merges (Sandhi) outperform the state of the art for the related task of German compound splitting.
Book ChapterDOI

Formal Structure of Sanskrit Text: Requirements Analysis for a Mechanical Sanskrit Processor

TL;DR: The mathematical structure of various levels of representation of Sanskrit text is discussed in order to guide the design of computer aids aiming at useful processing of the digitalised Sanskrit corpus.
Journal ArticleDOI

Design and analysis of a lean interface for Sanskrit corpus annotation

TL;DR: An innovative computer interface designed to assist annotators in the efficient selection of segmentation solutions for proper tagging of Sanskrit corpora is described, and a lexicon-acquisition facility is designed, which remedies this incompleteness and makes the interface more robust.

A Deterministic Dependency Parser with Dynamic Programming for Sanskrit

Amba Kulkarni
TL;DR: An interface that displays multiple parses compactly and facilitates users to select the desired parse among various possible solutions with a maximum of n 1 choices for a sentence with n words is described.
Book ChapterDOI

Extracting Dependency Trees from Sanskrit Texts

TL;DR: A hybrid dependency tree parser for Sanskrit sentences improving on a purely lexical parsing approach through simple syntactic rules and grammatical information is described.
References
More filters
Book

Introduction to Automata Theory, Languages, and Computation

TL;DR: This book is a rigorous exposition of formal languages and models of computation, with an introduction to computational complexity, appropriate for upper-level computer science undergraduates who are comfortable with mathematical arguments.
Journal ArticleDOI

Planning as heuristic search

TL;DR: A family of heuristic search planners are studied based on a simple and general heuristic that assumes that action preconditions are independent, which is used in the context of best-first and hill-climbing search algorithms, and tested over a large collection of domains.
Journal ArticleDOI

Recognition of visual activities and interactions by stochastic parsing

TL;DR: A probabilistic syntactic approach to the detection and recognition of temporally extended activities and interactions between multiple agents and how the system correctly interprets activities of multiple interacting objects is demonstrated.
Proceedings ArticleDOI

Parsing Free Word Order Languages in the Paninian Framework

TL;DR: This paper shows that the Paninian framework applied to modern Indian languages gives an elegant account of the relation between surface form (vibhakti) and semantic (karaka) roles, which suggests that the solution is not just adhoc but has a deeper underlying unity.
Journal ArticleDOI

A functional toolkit for morphological and phonological processing, application to a Sanskrit tagger

TL;DR: This work describes a general method for tagging a natural language text given as a phoneme stream by analysing possible euphonic liaisons between words belonging to a lexicon of inflected forms, and presents the Zen toolkit for morphological and phonological processing of natural languages.
Frequently Asked Questions (7)
Q1. What are the contributions mentioned in the paper "Analysis of sanskrit text : parsing and semantic relations" ?

In this paper, the authors are presenting their work towards building a dependency parser for Sanskrit language that uses deterministic finite automata ( DFA ) for morphological analysis and ’ utsarga apavaada ’ approach for relation analysis. 

Hence future works in this direction include parsing of compound sentences and incorporating Stochastic parsing. The authors are trying to come up with a good enough lexicon so that they can work in the direction of y ? ? in Sanskrit sentences. 

Although computational processing of Sanskrit language has been reported in the literature (Huet, 2005) with some computational toolkits (Huet, 2002), and there is work going on towards developing mathematical model and dependency grammar of Sanskrit(Huet, 2006), the proposed Sanskrit parser is being developed for using Sanskrit language as Indian networking language (INL). 

The authors have classified each of these pronouns into 9 classes: Personal, Demonstrative, Relative, Indefinitive, Correlative, Reciprocal and Possessive. 

While evaluating the Sanskrit words in the sentence, the authors have followed these steps for computation:1. First, a left-right parsing to separate out the words in the sentence is done. 

If the algorithm is able to generate a parse taking the longest possible match, the authors will not go into stacked possibilities, but if the subject disagrres with the verb (blocking), or some other mismatch is found, the authors will have to go for stacked possibilities. 

The paninian framework has been successfully applied to Indian languages for dependency grammars (Sangal, 1993), where constraint based parsing is used and mapping between karaka and vibhakti is via a TAM (tense, aspect, modality) tabel.