scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Parallelizing natural language techniques for knowledge extraction from cloud service level agreements

TL;DR: This paper significantly automated the process of extracting, managing and monitoring cloud SLAs using natural language processing techniques and Semantic Web technologies, and describes a prototype system that uses a Hadoop cluster to extract knowledge from unstructured legal text documents.
Abstract: To efficiently utilize their cloud based services, consumers have to continuously monitor and manage the Service Level Agreements (SLA) that define the service performance measures. Currently this is still a time and labor intensive process since the SLAs are primarily stored as text documents. We have significantly automated the process of extracting, managing and monitoring cloud SLAs using natural language processing techniques and Semantic Web technologies. In this paper we describe our prototype system that uses a Hadoop cluster to extract knowledge from unstructured legal text documents. For this prototype we have considered publicly available SLA/terms of service documents of various cloud providers. We use established natural language processing techniques in parallel to speed up cloud legal knowledge base creation. Our system considerably speeds up knowledge base creation and can also be used in other domains that have unstructured data.

Summary (1 min read)

Introduction

  • A knowledge base can be used to store complex information extracted from other structured and unstructured sources.
  • The authors are creating a ’Legal Knowledge Base’ that will store facts extracted form various legal documents.
  • The authors begin by creating a knowledge base from various Service Level Agreement (SLA) documents.

III. PARALLEL KNOWLEDGE EXTRACTION SYSTEM

  • Subject-predicate-object triples from cloud SLA documents.the authors.
  • The authors used a configurable Apache Hadoop [9] cluster of 2 to 4 nodes with Natural Language Took Kit (NLTK) [10], Stanford PoS Tagger[11] and CMU Link Parser [12] installed.
  • The authors system is divided into two sub systems ‘Extractor’ and ‘Assessor’ .

B. Assessor

  • The authors assess the quality of their extractions using a simple data type comparison with the Cloud SLA Ontology [16].
  • The authors verified the quality for a few documents manually and found that their data type comparison gave us good results.

C. Knowledge Base Creation

  • The main aim of their system is to create a knowledge base for various documents like cloud SLAs, legal documents, contracts, agreements etc.
  • The authors aim to store the knowledge base as an RDF [1] graph and provide a SPARQL[2] endpoint for the user to query.

IV. PERFORMANCE GAIN

  • To compare the performance of their parallel system the authors compare the time required to create their legal knowledge base from different number of identical documents with a single threaded system.
  • As per their experiments, the single threaded system performs better when the number of documents is small.
  • The Hadoop-based parallel system outperforms the single threaded system.the authors.
  • For their experiment the authors use 4 identically configured machines with 8 Gigabytes of RAM, and a quad-core 3.2 GHz processor.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Parallelizing Natural Language Techniques for Knowledge Extraction from Cloud
Service Level Agreements.
Sudip Mittal, Karuna P. Joshi, Claudia Pearce and Anupam Joshi
University of Maryland, Baltimore County
Baltimore, MD 21250, USA
Email: {smittal1,kjoshi1,cpearce,joshi}@umbc.edu
Abstract—To efficiently utilize their cloud based services,
consumers have to continuously monitor and manage the
Service Level Agreements (SLA) that define the service per-
formance measures. Currently this is still a time and labor
intensive process since the SLAs are primarily stored as text
documents. We have significantly automated the process of
extracting, managing and monitoring cloud SLAs using natural
language processing techniques and Semantic Web technolo-
gies. In this paper we describe our prototype system that uses
a Hadoop cluster to extract knowledge from unstructured legal
text documents. For this prototype we have considered publicly
available SLA/terms of service documents of various cloud
providers. We use established natural language processing
techniques in parallel to speed up cloud legal knowledge base
creation. Our system considerably speeds up knowledge base
creation and can also be used in other domains that have
unstructured data.
Keywords-Knowledge Extraction, Distributed Systems, Data
Mining.
I. INTRODUCTION
A knowledge base can be used to store complex in-
formation extracted from other structured and unstructured
sources. We are creating a ’Legal Knowledge Base’ that
will store facts extracted form various legal documents.
Currently, legal documents like, Service Level Agreements
(SLAs), contracts, compliance and regulatory policies, pri-
vacy policies, etc. are managed as plain text files meant
principally for human consumption. Creating this legal
knowledge base will make these documents machine under-
standable. Once we create our Legal Knowledge Base, we
can reason over it to find answers to specific legal questions.
Creation of a Legal Knowledge Base is the first step to create
a Legal question-answering system.
We begin by creating a knowledge base from various
Service Level Agreement (SLA) documents. We extract
various SLA Metrics and Term Definitions and store them
in subject-predicate-object triples in the popular RDF [1]
graph which can be queried by a user using a SPARQL[2]
endpoint. However, this system needs to be highly scalable
to deal with a large number of potential documents that
contain information relevant to the query.
Our Legal Knowledge Base can help consumers monitor,
contrast and analyze SLA documents for different cloud
service providers. We envision a system which will maintain
knowledge about various legal terms and clauses contained
in SLAs, compliance and regulatory policies, contracts, pri-
vacy documents etc. In this paper we describe the techniques
that we have developed to extract knowledge from various
cloud SLA documents using a Hadoop based system. Section
II covers the related work in this area. In sections III & IV
we describe the prototype system that we have created as
a proof of our concept. We end with the conclusions and
future work.
II. RELATED WORK
Researchers have applied Natural Language Processing
(NLP) techniques to extract information from text docu-
ments. In Rusu et. al. [3] the authors suggest an approach to
extract subject-predicate-object triplets. They generate Parse
Trees from English sentences and extract triplets from the
parse trees. Etzioni et. al. [4] developed a system to automate
the process of extracting large collections of facts from the
Web in an unsupervised, domain-independent, and scalable
manner. Etzioni et. al used Pattern Learning to address
this challenge. Various textual information extraction and
retrieval systems have been proposed in [5], [6]. Another
important natural language technique used for information
extraction from unstructured text is ‘Noun Phrase Extrac-
tion’. Rusu et. al. in [3] show how to create triplets by con-
sidering ‘Noun Phrases’ obtained by using various part-of-
speech taggers. Similar techniques have also been suggested
in [4]. Niu et. al. [7] [8] suggest various machine learning
and language based methods for knowledge base creation.
III. PARALLEL KNOWLEDGE EXTRACTION SYSTEM
In this section we describe our system to extract in
parallel, subject-predicate-object triples from cloud SLA
documents. We used a configurable Apache Hadoop [9]
cluster of 2 to 4 nodes with Natural Language Took Kit
(NLTK) [10], Stanford PoS Tagger[11] and CMU Link
Parser [12] installed. Our system is divided into two sub
systems ‘Extractor’ and Assessor’ (Figure 1).
A. Extractor
Extractor takes a SLA document as input and then extracts
all SLA metrics and term definitions found in that document.

Knowledge))
Base)
Assessor)
Documents)
Mapper)
Reducer)
Individual)phrases))
Intermi;ent)RDF))
)
Tagging)
)
Generated)
Result)
Figure 1. Architecture Diagram.
It is the core module of our system and is based on domain-
specific English patterns and rules. Our extraction tech-
nique is based on generally followed rules for writing legal
documents [13]. It first splits a document into individual
sentences and then uses natural language techniques like
‘Pattern Learning’ and ‘Noun Phrase Extraction’ to extract
knowledge.
Pattern Learning involves learning a few extraction pat-
terns which are then used to filter out term definitions from
unstructured SLA documents. A similar technique was used
by Etzioni et. al. [4] to extract city names from unstructured
data and Sipos et. al. [14] to extract triples from text. Rules
used in our system are listed in Table I.
In order to create triples we use the the technique of ‘Noun
Phrase Extraction’ [3], [15], [4]. In this technique we look at
the ‘Noun Phrase’ part of the sentence found in the treebank
structure generated by using the Stanford PoS Tagger[11]
and CMU Link Parser [12]. We then use these noun phrases
to create triplets.
Patterns
X is defined Y
X means Y
X is calculated Y
X is Y
Keywords
‘is defined’
‘means’
‘is calculated’
‘is’
Constraint
X is a quoted, bold, underlined or
italicised text.
Table I
PATTERN BASED RULES FOR OUR EXTRACTOR
In our implementation of a Hadoop cluster the map
function emits individual sentences and the reduce function
creates RDF statements by matching statements to patterns
and parsing treebank structures. An example iteration can
be found in Figure 2.
B. Assessor
We assess the quality of our extractions using a simple
data type comparison with the Cloud SLA Ontology [16].
We verified the quality for a few documents manually and
found that our data type comparison gave us good results.
We can in the future develop an independent machine
learning nodule to asses the system’s output.
C. Knowledge Base Creation
The main aim of our system is to create a knowledge base
for various documents like cloud SLAs, legal documents,
contracts, agreements etc. We aim to store the knowledge
base as an RDF [1] graph and provide a SPARQL[2]
endpoint for the user to query.
IV. PERFORMANCE GAIN
To compare the performance of our parallel system we
compare the time required to create our legal knowledge
base from different number of identical documents with
a single threaded system. As per our experiments, the
single threaded system performs better when the number
of documents is small. However as we keep on increasing
the number of documents, the Hadoop-based parallel system
outperforms the single threaded system. For our experiment
we use 4 identically configured machines with 8 Gigabytes
of RAM, and a quad-core 3.2 GHz processor.
Figure 3. Time required to create knowledge base with different number
of documents for the single threaded, 2 node cluster and 4 node cluster.
V. CONCLUSIONS & FUTURE WORK
In this paper, we describe our prototype system which
parallelly extracts knowledge from legal documents and
create a knowledge base for the same. We envision a system
which will have knowledge about various legal documents,
contracts, agreements etc. and will be able to automatically
suggest a service based on one’s needs. We achieve a con-
siderable speed up by parallelizing our system. We believe
that parallel knowledge base population is faster and can
be also used in other domains to extract knowledge. In the

X= Service Credit
Y = a dollar credit, calculated as set forth below, that we
may credit back to an eligible account
definition(Service Credit, a dollar credit, calculated as set forth below, that we
may credit back to an eligible account)
Link Parser
Pattern Matching
Assignment
A Service Credit is a dollar credit, calculated as set forth below, that we
may credit back to an eligible account
Mapper
Reducer
Knowledge Graph
Figure 2. An example iteration where a sentence is converted to a tree structure and then to a RDF statement using Pattern Matching.
future we would like to extend our system to include various
other legal documents and agreements.
REFERENCES
[1] “Resource description framework (rdf). [Online]. Available:
http://www.w3.org/RDF/
[2] “Sparql protocol and rdf query language 1.1 overview. [On-
line]. Available: http://www.w3.org/TR/sparql11-overview/
[3] D. Rusu, L. Dali, B. Fortuna, M. Grobelnik, and D. Mladenic,
“Triplet extraction from sentences, in Proceedings of the 10th
International Multiconference” Information Society-IS, 2007,
pp. 8–12.
[4] O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu,
T. Shaked, S. Soderland, D. S. Weld, and A. Yates, “Unsuper-
vised named-entity extraction from the web: An experimental
study, Artificial intelligence, vol. 165, no. 1, pp. 91–134,
2005.
[5] F. Ciravegna, “2, an adaptive algorithm for information
extraction from web-related texts, in In Proceedings of
the IJCAI-2001 Workshop on Adaptive Text Extraction and
Mining. Citeseer, 2001.
[6] P. Cimiano, S. Staab, and J. Tane, Automatic acquisition
of taxonomies from text: Fca meets nlp, in Proceedings
of the International Workshop & Tutorial on Adaptive Text
Extraction and Mining held in conjunction with the 14th
European Conference on Machine Learning and the 7th Eu-
ropean Conference on Principles and Practice of Knowledge
Discovery in Databases, 2003.
[7] F. Niu, C. Zhang, C. R
´
e, and J. W. Shavlik, “Deepdive: Web-
scale knowledge-base construction using statistical learning
and inference. 2012.
[8] F. Niu, C. Zhang, C. R
´
e, and J. Shavlik, “Elementary: Large-
scale knowledge-base construction via machine learning and
statistical inference, International Journal on Semantic Web
and Information Systems (IJSWIS), vol. 8, no. 3, pp. 42–73,
2012.
[9] T. White, Hadoop: The Definitive Guide, 1st ed. O’Reilly
Media, Inc., 2009.
[10] S. Bird, E. Klein, and E. Loper, Natural Language Processing
with Python. O’Reilly Media, 2009.
[11] “The stanford parser: A statistical parser. [Online]. Available:
http://nlp.stanford.edu/software/lex-parser.shtml
[12] “Carnegie mellon university link grammar. [Online].
Available: http://www.link.cs.cmu.edu/link/
[13] “Drafting legal documents. [Online]. Available: http://www.
archives.gov/federal-register/write/legal-docs/definitions.html
[14] R. Sipo
ˇ
s, D. Mladeni
´
c, M. Grobelnik, and J. Brank, “Model-
ing common real-word relations using triples extracted from
n-grams, in The Semantic Web. Springer, 2009, pp. 16–30.
[15] K. Barker and N. Cornacchia, “Using noun phrase heads
to extract document keyphrases, in Advances in Artificial
Intelligence. Springer, 2000, pp. 40–52.
[16] K. Joshi and T. Finin, “Ontology for
cloud services sla. [Online]. Available:
http://ebiquity.umbc.edu/resource/html/id/344/
Ontology-for-Cloud-Services-SLA-Service-Level-Agreement
Citations
More filters
Journal ArticleDOI
TL;DR: A new trust framework, called Context-Aware Multifaceted Trust Framework (CAMFT), is proposed to assist in evaluating trust in cloud service providers and is flexible and context aware: it considers trust factors, users and services.

31 citations

Proceedings ArticleDOI
01 Dec 2016
TL;DR: A semantically rich ontology is developed to describe the privacy policy documents and a database of several policy documents is built as instances of this ontology based on deontic logic which can be used to automate management of data privacy.
Abstract: Ensuring privacy of Big Data managed on the cloud is critical to ensure consumer confidence. Cloud providers publish privacy policy documents outlining the steps they take to ensure data and consumer privacy. These documents are available as large text documents that require manual effort and time to track and manage. We have developed a semantically rich ontology to describe the privacy policy documents and built a database of several policy documents as instances of this ontology. We next extracted rules from these policy documents based on deontic logic which can be used to automate management of data privacy. In this paper we describe our ontology in detail along with the results of our analysis of privacy policies of prominent cloud services.

27 citations


Cites methods from "Parallelizing natural language tech..."

  • ...Referring to the NIST guidelines on cloud privacy [3] and PII information [1], we have identified the key components of a privacy notice that are defined as object properties in the main Privacy Policy class....

    [...]

Proceedings ArticleDOI
01 Dec 2017
TL;DR: This paper has developed a novel framework to automatically track details about how a user's PII is stored, used and shared by the provider, and integrated its data privacy ontology with the properties of blockchain to develop an automated access-control and audit mechanism that enforces users' data privacy policies when sharing their data across third parties.
Abstract: With the advent of numerous online content providers, utilities and applications, each with their own specific version of privacy policies and its associated overhead, it is becoming increasingly difficult for concerned users to manage and track the confidential information that they share with the providers. We have developed a novel framework to automatically track details about how a user's PII is stored, used and shared by the provider. We have integrated our data privacy ontology with the properties of blockchain, to develop an automated access-control and audit mechanism that enforces users' data privacy policies when sharing their data across third parties. We have also validated this framework by implementing a working system LinkShare. In this paper, we describe our framework on detail along with the LinkShare system. Our approach can be adopted by big data users to automatically apply their privacy policy on data operations and track the flow of that data across various stakeholders.

20 citations

Proceedings ArticleDOI
18 Sep 2016
TL;DR: ALDA, a legal cognitive assistant to analyze digital legal documents is discussed and some of the preliminary results obtained by analyzing legal documents using techniques such as semantic web, text mining and graph analysis are presented.
Abstract: In recent times, there has been an exponential growth in digitization of legal documents such as case records, contracts, terms of services, regulations, privacy documents and compliance guidelines. Courts have been digitizing their archived cases and also making it available for e-discovery. On the other hand, businesses are now maintaining large data sets of legal contracts that they have signed with their employees, customers and contractors. Large public sector organizations are often bound by complex legal legislation and statutes. Hence, there is a need of a cognitive assistant to analyze and reason over these legal rules and help people make decisions. Today the process of monitoring an ever increasing dataset of legal contracts and ensuring regulations and compliance is still very manual and labour intensive. This can prove to be a bottleneck in the smooth functioning of an enterprise. Automating these digital workflows is quite hard because the information is available as text documents but it is not represented in a machine understandable way. With the advancements in cognitive assistance technologies, it is now possible to analyze these digitized legal documents efficiently. In this paper, we discuss ALDA, a legal cognitive assistant to analyze digital legal documents. We also present some of the preliminary results we have obtained by analyzing legal documents using techniques such as semantic web, text mining and graph analysis.

18 citations


Cites background or methods from "Parallelizing natural language tech..."

  • ...We have used this approach for automating Cloud service level agreements (Joshi, Yesha, and Finin 2014; Mittal et al. 2015; 2016; Gupta et al. 2016)....

    [...]

  • ...Details can be found in (Mittal et al. 2015; 2016)....

    [...]

References
More filters
Book
29 May 2009
TL;DR: This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoops clusters.
Abstract: Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters. Complete with case studies that illustrate how Hadoop solves specific problems, this book helps you: Use the Hadoop Distributed File System (HDFS) for storing large datasets, and run distributed computations over those datasets using MapReduce Become familiar with Hadoop's data and I/O building blocks for compression, data integrity, serialization, and persistence Discover common pitfalls and advanced features for writing real-world MapReduce programs Design, build, and administer a dedicated Hadoop cluster, or run Hadoop in the cloud Use Pig, a high-level query language for large-scale data processing Take advantage of HBase, Hadoop's database for structured and semi-structured data Learn ZooKeeper, a toolkit of coordination primitives for building distributed systems If you have lots of data -- whether it's gigabytes or petabytes -- Hadoop is the perfect solution. Hadoop: The Definitive Guide is the most thorough book available on the subject. "Now you have the opportunity to learn about Hadoop from a master-not only of the technology, but also of common sense and plain talk." -- Doug Cutting, Hadoop Founder, Yahoo!

3,797 citations


"Parallelizing natural language tech..." refers background in this paper

  • ...We end with the conclusions and future work....

    [...]

Book
12 Jun 2009
TL;DR: This book offers a highly accessible introduction to natural language processing, the field that supports a variety of language technologies, from predictive text and email filtering to automatic summarization and translation.
Abstract: This book offers a highly accessible introduction to natural language processing, the field that supports a variety of language technologies, from predictive text and email filtering to automatic summarization and translation. With it, you'll learn how to write Python programs that work with large collections of unstructured text. You'll access richly annotated datasets using a comprehensive range of linguistic data structures, and you'll understand the main algorithms for analyzing the content and structure of written communication. Packed with examples and exercises, Natural Language Processing with Python will help you: Extract information from unstructured text, either to guess the topic or identify "named entities" Analyze linguistic structure in text, including parsing and semantic analysis Access popular linguistic databases, including WordNet and treebanks Integrate techniques drawn from fields as diverse as linguistics and artificial intelligence This book will help you gain practical skills in natural language processing using the Python programming language and the Natural Language Toolkit (NLTK) open source library. If you're interested in developing web applications, analyzing multilingual news sources, or documenting endangered languages -- or if you're simply curious to have a programmer's perspective on how human language works -- you'll find Natural Language Processing with Python both fascinating and immensely useful.

3,361 citations

Book ChapterDOI
17 Dec 2009
TL;DR: An introduction to RDF and its related vocabulary definition language RDF Schema is provided, and its relationship with the OWL Web Ontology Language is explained.
Abstract: The Resource Description Framework (RDF) is the standard knowledge representation language for the Semantic Web, an evolution of the World Wide Web that aims to provide a well-founded infrastructure for publishing, sharing and querying structured data. This article provides an introduction to RDF and its related vocabulary definition language RDF Schema, and explains its relationship with the OWL Web Ontology Language. Finally, it provides an overview of the historical development of RDF and related languages for Web metadata.

1,255 citations

Journal ArticleDOI
TL;DR: An overview of KnowItAll's novel architecture and design principles is presented, emphasizing its distinctive ability to extract information without any hand-labeled training examples, and three distinct ways to address this challenge are presented and evaluated.

1,201 citations


"Parallelizing natural language tech..." refers background or methods or result in this paper

  • ...Another important natural language technique used for information extraction from unstructured text is ‘Noun Phrase Extraction’....

    [...]

  • ...Similar techniques have also been suggested in [4]....

    [...]

  • ...We end with the conclusions and future work....

    [...]

Book ChapterDOI
TL;DR: The simple noun phrase-based system performs roughly as well as a state-of-the-art, corpus-trained keyphrase extractor; ratings for individual keyphrases do not necessarily correlate with ratings for sets of keyphRases for a document.
Abstract: Automatically extracting keyphrases from documents is a task with many applications in information retrieval and natural language processing. Document retrieval can be biased towards documents containing relevant keyphrases; documents can be classified or categorized based on their keyphrases; automatic text summarization may extract sentences with high keyphrase scores. This paper describes a simple system for choosing noun phrases from a document as keyphrases. A noun phrase is chosen based on its length, its frequency and the frequency of its head noun. Noun phrases are extracted from a text using a base noun phrase skimmer and an off-the-shelf online dictionary. Experiments involving human judges reveal several interesting results: the simple noun phrase-based system performs roughly as well as a state-of-the-art, corpus-trained keyphrase extractor; ratings for individual keyphrases do not necessarily correlate with ratings for sets of keyphrases for a document; agreement among unbiased judges on the keyphrase rating task is poor.

270 citations


"Parallelizing natural language tech..." refers result in this paper

  • ...Similar techniques have also been suggested in [4]....

    [...]

Frequently Asked Questions (2)
Q1. What have the authors contributed in "Parallelizing natural language techniques for knowledge extraction from cloud service level agreements" ?

The authors have significantly automated the process of extracting, managing and monitoring cloud SLAs using natural language processing techniques and Semantic Web technologies. In this paper the authors describe their prototype system that uses a Hadoop cluster to extract knowledge from unstructured legal text documents. For this prototype the authors have considered publicly available SLA/terms of service documents of various cloud providers. 

In the future the authors would like to extend their system to include various other legal documents and agreements.