scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Data Trawling and Security Strategies

TL;DR: The data leakage problem arises because the services like Facebook and Google store all your data unencrypted on their servers, making it easy for them, or governments and hackers, to monitor the data.
Abstract: The amount of data in the world seems increasing and computers make it easy to save the data. Companies offer data storage by providing cloud services and the amount of data being stored in these servers is increasing rapidly. In data mining, the data is stored electronically and the search is automated or at least augmented by computer. As the volume of data increases, inexorably, the proportion of it that people understand decreases alarmingly. This paper presents the data leakage problem arises because the services like Facebook and Google store all your data unencrypted on their servers, making it easy for them, or governments and hackers, to monitor the data.

Content maybe subject to copyright    Report

IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. III (Nov Dec. 2014), PP 57-59
www.iosrjournals.org
www.iosrjournals.org 57 | Page
Data Trawling and Security Strategies
Venkata Karthik Gullapalli
1
, Aishwarya Asesh
2
1
(School of Computing Science and Engineering, VIT University, India)
2
(School of Computing Science and Engineering, VIT University, India)
Abstract: The amount of data in the world seems increasing and computers make it easy to save the data.
Companies offer data storage by providing cloud services and the amount of data being stored in these servers
is increasing rapidly. In data mining, the data is stored electronically and the search is automated or at least
augmented by computer. As the volume of data increases, inexorably, the proportion of it that people
understand decreases alarmingly. This paper presents the data leakage problem arises because the services like
Facebook and Google store all your data unencrypted on their servers, making it easy for them, or governments
and hackers, to monitor the data.
I. Introduction
Data mining is defined as the process of discovering patterns in data. The process must be automatic or
semiautomatic. The patterns discovered must be meaningful in that they lead to some advantage, usually an
economic one. The data is invariably present in substantial quantities. Data mining is about solving problems by
analyzing data already present in databases. The World Wide Web is becoming an important medium for
sharing the information related to wide range of topics. According to most predictions, majority of the human
information will be available on the web in five years. The building blocks of data mining is the evolution of a
field with the confluences of various disciplines, which includes database management systems(DBMS),
Statistics, Artificial Intelligence(AI), and Machine Learning(ML) [3]. Given a truly massive amount of data, the
challenge in data mining is to unearth hidden relationships among various attributes of data and between several
snapshots of data over a period of time [7]. These hidden patterns have enormous potential in prediction and
personalization [7].
II. Data Leakage Problems
Major organizations still leave users passwords vulnerable Password vulnerabilities ought to be a
rarity. Well-known and easily-followed techniques exist for generating, using and storing passwords that should
keep both individuals and organizations safe. Yet in 2012 we saw one massive password breach after another, at
a slew of high profile organizations. Russian cybercriminals posted nearly 6.5 million LinkedIn passwords on
the Internet. Teams of hackers rapidly went to work attacking those passwords, and cracked more than 60%
within days. That task was made simpler by the fact that LinkedIn hadn’t ―salted‖ its password database with
random data before encrypting it. Dating website eHarmony quickly reported that some 1.5 million of its own
passwords were uploaded to the web following the same attack that hit LinkedIn. Form spring discovered that
the passwords of 420,000 of its users had been compromised and posted online, and instructed all 28 million of
the site’s members to change their passwords as a precaution. Yahoo Voices admitted that nearly 500,000 of its
own emails and passwords had been stolen. Multinational technology firm Philips was attacked by the rootbeer
gang. The gang walked away with thousands of names, telephone numbers, addresses and unencrypted
passwords. IEEE, the world’s largest professional association for the advancement of technology, left a log file
of nearly 400 million web requests in a world-readable directory. Those requests included the usernames and
plain text passwords of nearly 100,000 unique user’s.
In an attempt to ascertain Cloud Computing reliability, 11,491 news articles on cloud computing-
related outages from 39 news sources between Jan 2008 and Feb 2012 effectively covering the first five years
of cloud computing - were reviewed [1]. During this period, the number of cloud vulnerability incidents rose
considerably. For instance, the number of cloud vulnerability incidents more than doubled over a four year
period, increasing from 33 in 2009 to 71 in 2011. A total of 172 unique cloud computing outage incidents were
uncovered, of which 129 (75%) declared their cause(s) while 43 (25%) did not. As cloud computing matures
into mainstream computing, transparency in the disclosure of outages is imperative.
The scope for data leakage is very wide, and not limited to just email and web. We are all too familiar
with stories of data loss from laptop theft, hacker break-ins, and backup tapes being lost or stolen, and so on.
How can we defend ourselves against the growing threat of data leakage attacks via messaging, social
engineering, malicious hackers, and more? Many manufacturers have products to help reduce electronic data
leakage, but do not address other vectors. This paper aims to provide a holistic discussion on data leakage and
its prevention, and serve as a starting point for businesses in their fight against it.

Data Trawling and Security Strategies
www.iosrjournals.org 58 | Page
On August 31, 2014, a collection of almost 200 private pictures of various celebrities and these images
were believed to have been obtained via a breach of Apple's cloud services suite iCloud. It has
been reported that a collection of 5 million Gmail addresses and passwords have been leaked. Recently,
usernames and passwords of Dropbox users have leaked online. The usernames and passwords were
unfortunately stolen from other services and used in attempts to log in to Dropbox accounts. New and
sophisticated techniques that have been developed in the area of data mining (also known as knowledge
discovery), can aid in the extraction of useful information from the web [8]. The correct solution would be to
encourage the creation and use of de-centralized and end-to-end encrypted services that do not store all your
data in one place. For the sake of national security and to protect the privacy of its citizens, India should develop
its own social media platforms.
III. Data Leakage Prevention (DLP)
The use of dataparticularly data about peoplefor data mining has serious ethical implications, and
practitioners of data mining techniques must act responsibly by making themselves aware of the ethical issues
that surround their particular application [6]. The explosion of online social networking (OSN) in recent years
has caused damages to organizations due to leakage of information. Peoples’ social networking behavior,
whether accidental or intentional, provides an opportunity for advanced persistent threats (APT) attackers to
realize their social engineering techniques and undetectable zero-day exploits. APT attackers use a spear-
phishing method that targeted on victim organizations through social media in order to conduct reconnaissance
and theft of confidential proprietary information. OSN has the most challenging channel of information leakage
and provides an explanation about the underlying factors of employees leaking information via this channel
through a theoretical lens from information systems. OSN becomes an attack vector of APT owing to Peoples’
social networking behavior, and finally, recommends security education, training and awareness (SETA) for
organizations to combat these threats. Various data mining techniques (Induction, Compression and
Approximation) and algorithms developed to mine the large volumes of heterogeneous data stored in the data
warehouses [3].
Data leak prevention (DLP) is a suite of technologies aimed at stemming the loss of sensitive
information that occurs in enterprises across the globe. By focusing on the location, classification and
monitoring of information at rest, in use and in motion, this solution can go far in helping an enterprise get a
handle on what information it has, and in stopping the numerous leaks of information that occur each day. DLP
is not a plug-and-play solution. The successful implementation of this technology requires significant
preparation and diligent ongoing maintenance. Enterprises seeking to integrate and implement DLP should be
prepared for a significant effort that, if done correctly, can greatly reduce risk to the organization. Those
implementing the solution must take a strategic approach that addresses risks, impacts and mitigation steps,
along with appropriate governance and assurance measures.
3.1 Data Anonymization Removing Personally Identifiable Information From Data Sets
A K-anonymized dataset has the property that each record is in distinguishable from at least others.
Even simple restrictions of optimized anonymity are NP-hard, leading to significant computational challenges.
New approach methods can be evolved exploring the space of possible anonymization that tames the
combinatory of the problem, and develop data-man- agreement strategies to reduce reliance on expensive
operations such as sorting. A desirable feature of protecting privacy through - anonymity is its preservation of
data integrity. Despite its intuitive appeal, it is possible that non-integrity preserving approaches to privacy (such
as random perturbation) may produce a more informative result in many circumstances. Indeed, it may be
interesting to consider combined approaches, such as -anonymizing over only a subset of potentially identifying
columns and randomly perturbing the others. A better understanding of when and how to apply various privacy-preserving
methods deserves further study. Optimal algorithms will be useful in this regard since they eliminate the possibility that a
poor outcome is the result of a highly sub-optimal solution rather than an inherent limitation of the specific technique.
3.2 De-Identification And Linking Data Records
Common strategies for de-identifying datasets are deleting or masking personal identifiers, such as name and
social security number, and suppressing or generalizing quasi-identifiers, such as date of birth and zip code.
3.3 Dlp Implementation Challenges
User resistance for change is the most difficult obstacle which has to be handled with greatest care.
Training workshops and seminars must be held on regular basis to infuse confidence in them for adopting DLP
procedures. The effectiveness of DLP solution must be closely monitored to iron out any issues if they arise
during implementation. Likewise, over-optimism also needs to be checked upon as people tend to get carried
away and get over dependent on the DLP technology. Policy and procedure framework must be properly
documented and accordingly implemented.

Data Trawling and Security Strategies
www.iosrjournals.org 59 | Page
IV. Ideology And Reasons
This paper motivates the future work in this area through a review of the field and related research
questions. Specifically, this paper defines the data leak prevention problem, describe current approaches, and
outline potential research directions in the field. As a part of this discussion, this paper explores the idea that
while intrusion detection techniques may be applicable to many aspects of the data leak prevention problem, the
problem is distinct enough that it requires its own solutions. A social network is a social structure made up of
individuals or organizations called nodes, which are connected by one or more specific types of
interdependency, such as friendship, common interest, and exchange of finance, relationships of beliefs,
knowledge or prestige. A cyber threat can be both unintentional and intentional, targeted or non-targeted, and it
can come from a variety of sources, including foreign nations engaged in espionage and information warfare,
criminals, hackers, virus writers, disgruntled employees and contractors working within an organization. Social
networking sites are not only to communicate or interact with other people globally, but also one effective way
for business promotion. Investigation and study can be made on the cyber threats in social networking websites.
After going through the paper which states an amassing history of online social websites one can classify their
types and also discuss the cyber threats, suggest the anti-threats strategies and visualize the future trends of such
hoppy popular websites. Social networking sites spread information faster than any other media. Over 50% of
people learn about breaking news on social media. 65% of traditional media reporters and editors use sites like
Facebook and LinkedIn for story research, and 52% use Twitter. Social networking sites are the top news source
for 27.8% of Americans, ranking close to newspapers (28.8%) and above radio (18.8%) and other print
publications (6%). Twitter and YouTube users reported the July 20, 2012 Aurora, CO theater shooting before
news crews could arrive on the scene, and the Red Cross urged witnesses to tell family members they were safe
via social media outlets. In the same breath one could argue that social media enables the spread of unreliable
and false information. 49.1% of people have heard false news via social media. On Sep. 5, 2012 false rumors of
fires, shootouts, and caravans of gunmen in a Mexico City suburb spread via Twitter and Facebook caused
panic, flooded the local police department with over 3,000 phone calls, and temporarily closed schools.
V. Conclusion
It is a big task to mine the data with more accuracy and processing time. The research we develop
using rule induction along with Association rule mining algorithm in data mining is beneficial in terms of
accuracy and processing time [5]. Further research aspects can be done on these aspects of data leakage in India
and how these issues can be solved by applying the correct prevention technique or correct implementation of
data handling methods. A Person’s Private information can be among its most valuable assets. DLP solutions
can offer a multifaceted capability to significantly increase a person’s ability to manage risks to their key
information assets. However, these solutions can be complex and prone to disrupt other processes and
organizational culture if improperly or hurriedly implemented. Careful planning and preparation,
communication and awareness training are paramount in deploying a successful DLP program amongst the
common population of India or the Entrepreneur groups.
References
Journal Papers:
[1] Brijesh Kumar Baradwaj, and Saurabh Pal, Mining Educational Data to Analyze Students Performance, International Journal of
Advanced Computer Science and Applications (IJACSA), Vol. 2, No. 6, 2011.
[2] V. Sangamithra, T. Kalaikumaran and S. Karthik, Data mining techniques for detecting the crime hotspot by using GIS, ISSN 2278
- 1323, International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), Volume 1, Issue 10,
December 2012.
[3] Amit Kapoor, Data Mining: Past, Present and Future Scenario, ISSN 2278-6856, International Journal of Emerging Trends &
Technology in Computer Science (IJETTCS), Volume 3, Issue 1, January February 2014.
[4] Naveeta Mehata, and Shilpa Dang, Data Mining Techniques for Identifying the Customer Behavior of Investment in Stock Market
in India, ISSN 2277 3622, International Journal of Marketing, Financial Services & Management Research, Vol.1 Issue 11,
November 2012.
[5] Kapil Sharma, Sheveta Vashisht, Heena Sharma, Richa Dhiman, and Jasreena Kaur Bains, A Hybrid Approach Based On
Association Rule Mining and Rule Induction in Data Mining, ISSN: 2231-2307 International Journal of Soft Computing and
Engineering (IJSCE), Volume-3, Issue-1, March 2013.
Books:
[6] Ian H. Witten, Eibe Frank, and Mark A. Hall, Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann
Publishers, Burlington, MA 01803, USA).
Chapters in Books:
[7] N R Srinivasa Raghavan, Data mining in e-commerce: A survey, (Sadhana Vol. 30, Parts 2 & 3, April/June 2005) 275289.
Theses:
[8] Minos N. Garofalakis, Rajeev Rastogi, S. Seshadri, and Kyuseok Shim, Data Mining and the Web: Past, Present and Future, Bell
Laboratories.
Citations
More filters
Journal ArticleDOI
TL;DR: A novel methodology to implement supervised learning networks that is perceptron networks in medical diagnosis for providing such good decisions to the doctors in helping patients by also providing a good health care is presented.
Abstract: Artificial intelligence applications in medicine is the major and evolutionary topic in the technology world. Neural networks is an important branch of machine learning which is inspired from biological neural networks. Neural networks are useful in making proper decisions in rational environments with uncertainty. The neural networks perform better computation with high power with the help of the multiple interconnected neurons which act as processing elements. Decision theory along with probabilistic theory gives the good way to make the right decisions. Neural Network systems help in linking the health observations with the health knowledge database to take better decisions for good health. The ability of a neural network to learn by example can be implemented for taking decisions that would increase the rate of providing the better medical care facilities. This paper presents a novel methodology to implement supervised learning networks that is perceptron networks in medical diagnosis for providing such good decisions to the doctors in helping patients by also providing a good health care.

7 citations

Journal Article
TL;DR: A methodology is presented to use the efficient optimization algorithms as an alternative for the gradient descent machine learning algorithm as an optimization algorithm.
Abstract: Optimization is considered to be one of the pillars of statistical learning and also plays a major role in the design and development of intelligent systems such as search engines, recommender systems, and speech and image recognition software. Machine Learning is the study that gives the computers the ability to learn and also the ability to think without being explicitly programmed. A computer is said to learn from an experience with respect to a specified task and its performance related to that task. The machine learning algorithms are applied to the problems to reduce efforts. Machine learning algorithms are used for manipulating the data and predict the output for the new data with high precision and low uncertainty. The optimization algorithms are used to make rational decisions in an environment of uncertainty and imprecision. In this paper a methodology is presented to use the efficient optimization algorithm as an alternative for the gradient descent machine learning algorithm as an optimization algorithm.

3 citations

DatasetDOI
27 Feb 2015
TL;DR: A schema that ensures encryption of data using Advanced Encryption Standards is described that can help in further enhancement of the cloud computing standards.
Abstract: Cloud computing has reached a certain level of maturity which leads to a defined productive state. With varying amount of computing power present with everyone, it has become necessity of the hour to use cloud computing systems. It helps us to store our data within a virtual cloud structure. When we use the cloud storage mechanism, the computing power gets distributed rather than being centralised. The whole system uses the internet communication to allow linkage between client side and server side services/applications. The service providers may use the cloud platform as a web service platform or a data storage architecture. The freedom to use any device and location for cloud management is an added advantage for any user. Maintenance of such systems is also easy as installation of resources aren't required in each and every system which is using cloud services. But along with varying flexibility and multi tenancy in usage comes the question of reliability and security. As in public hosting, the client is totally unaware of the security strategies applied by the service provider, it creates a necessity for the end user to save the data from expected threats. One cannot totally rely on the quality of service (QoS) which is guaranteed by host servers. When we look at the security of data in the cloud computing, the vendor has to provide some assurance in service level agreements (SLA) to convince the customer on security factors. This paper describes a schema that ensures encryption of data using Advanced Encryption Standards. By doing so, the customer services can become quiet secured and thus can help in further enhancement of the cloud computing standards.

3 citations

Journal ArticleDOI
TL;DR: A methodology is presented to use the efficient optimization algorithms as an alternative for the gradient descent machine learning algorithm as an optimization algorithm.
Abstract: Optimization is considered to be one of the pillars of statistical learning and also plays a major role in the design and development of intelligent systems such as search engines, recommender systems, and speech and image recognition software. Machine Learning is the study that gives the computers the ability to learn and also the ability to think without being explicitly programmed. A computer is said to learn from an experience with respect to a specified task and its performance related to that task. The machine learning algorithms are applied to the problems to reduce efforts. Machine learning algorithms are used for manipulating the data and predict the output for the new data with high precision and low uncertainty. The optimization algorithms are used to make rational decisions in an environment of uncertainty and imprecision. In this paper a methodology is presented to use the efficient optimization algorithm as an alternative for the gradient descent machine learning algorithm as an optimization algorithm.

1 citations

References
More filters
Book
25 Oct 1999
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

20,196 citations

Journal ArticleDOI
TL;DR: In this article, a data mining model for higher education system in the university is presented, where the classification task is used to evaluate student's performance and as there are many approaches that are used for data classification, the decision tree method is used here.
Abstract: The main objective of higher education institutions is to provide quality education to its students. One way to achieve highest level of quality in higher education system is by discovering knowledge for prediction regarding enrolment of students in a particular course, alienation of traditional classroom teaching model, detection of unfair means used in online examination, detection of abnormal values in the result sheets of the students, prediction about students' performance and so on. The knowledge is hidden among the educational data set and it is extractable through data mining techniques. Present paper is designed to justify the capabilities of data mining techniques in context of higher education by offering a data mining model for higher education system in the university. In this research, the classification task is used to evaluate student's performance and as there are many approaches that are used for data classification, the decision tree method is used here. By this task we extract knowledge that describes students' performance in end semester examination. It helps earlier in identifying the dropouts and students who need special attention and allow the teacher to provide appropriate advising/counseling. Keywords-Educational Data Mining (EDM); Classification; Knowledge Discovery in Database (KDD); ID3 Algorithm.

492 citations

Proceedings ArticleDOI
01 Nov 1999
TL;DR: Popular data mining techniques like association rules, classification, clustering and outlier detection are reviewed as well as efficient algorithms for implementing the technique, that have been proposed by researchers in recent years are discussed.
Abstract: The World Wide Web is rapidly emerging as an important medium for transacting commerce as well as for the dissemination of information related to a wide range of topics (e.g., business, government, recreation). According to most predictions, the majority of human information will be available on the Web in ten years. These huge amounts of data raise a grand challenge, namely, how to turn the Web into a more useful information utility. Crawlers, search engines and Web directories like Yahoo! constitute the state-of-the-art tools for information retrieval on the Web today. Crawlers for the major search engines retrieve Web pages on which full-text indexes are constructed. A user query is simply a list of keywords (with some additional operators), and the query response is a list of pages ranked based on their similarity to the query. Today’s search tools, however, are plagued by the following four problems: (1) the abundance problem, that is, the phenomenon of hundreds of irrelevant documents being returned in response to a search query, (2) limited coverage of the Web, (3) a limited query interface that is based on syntactic keyword-oriented search, and (4) limited customization to individual users. These problems, in turn, can be attributed to the following characteristics of the Web. First and foremost, the Web is a huge, diverse and dynamic collection of interlinked hypertext documents. There are about 300 million pages on the Web today with about 1 million being added daily. Furthermore, it is widely believed that 99% of the information on the Web is of no interest to 99% of the people. Second, except for hyperlinks, the Web is largely unstructured. Finally, most information on the Web is in the form of HTML documents for which analysis and extraction of content is very difficult. Furthermore, the contents of many internet sources are hidden behind search interfaces and, thus, cannot be indexed – HTML documents are dynamically generated by these sources, in response to queries, using data stored in commercial DBMSs. The question therefore is: how can we overcome these and other challenges that impede the Web resource discovery process? Fortunately, new and sophisticated techniques that have been developed in the area of data mining (also known as knowledge discovery), can aid in the extraction of useful information from the web. Data mining algorithms have been shown to scale well for large data sets and have been successfully applied to several areas like medical diagnosis, weather prediction, credit approval, customer segmentation, marketing and fraud detection. In this paper, we begin by reviewing popular data mining techniques like association rules, classification, clustering and outlier detection. We provide a brief description of each technique as well as efficient algorithms for implementing the technique. We then discuss algorithms for discovering Web, hypertext and hyperlink structure, that have been proposed by researchers in recent years. The key difference between these algorithms and earlier data mining algorithms is that the latter take hyperlink information into account. Finally, we conclude by listing research issues that still remain to be addressed in the area of Web Mining.

122 citations

Journal ArticleDOI
TL;DR: The intent is not to survey the plethora of algorithms in data mining; instead, the current focus being e-commerce, the discussion is limited to data mining in the context of e- commerce.
Abstract: Data mining has matured as a field of basic and applied research in computer science in general and e-commerce in particular In this paper, we survey some of the recent approaches and architectures where data mining has been applied in the fields of e-commerce and e-business Our intent is not to survey the plethora of algorithms in data mining; instead, our current focus being e-commerce, we limit our discussion to data mining in the context of e-commerce We also mention a few directions for further work in this domain, based on the survey

60 citations


"Data Trawling and Security Strategi..." refers background in this paper

  • ...These hidden patterns have enormous potential in prediction and personalization [7]....

    [...]

  • ...Given a truly massive amount of data, the challenge in data mining is to unearth hidden relationships among various attributes of data and between several snapshots of data over a period of time [7]....

    [...]

01 Jan 2012
TL;DR: A survey on how GIS is implemented in various crime activities is presented, using geography and computer-generated maps as an interface for integrating and accessing massive amounts of location-based information.
Abstract:  Abstract— A GIS is a computer system capable of capturing, storing, analyzing, and displaying geographically referenced information. GIS works by relating information from different sources. The power of a GIS comes from the ability to relate different information in a spatial context and to reach a conclusion about this relationship. Most of the information we have about our world contains a location reference, placing that information at some point on the globe. . Geographic Information System (GIS) plays an important role in crime mapping and analysis.GIS uses geography and computer-generated maps as an interface for integrating and accessing massive amounts of location-based information. GIS allows police personnel to plan effectively for emergency response, determine mitigation priorities, analyse historical events, and predict future events. GIS can also be used to get critical information to emergency responders upon dispatch or while en route to an incident to assist in tactical planning and response. GIS helps identify potential suspects to increase investigators suspect base when no leads are evident. This paper presents a survey on how GIS is implemented in various crime activities.

4 citations