scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Data exfiltration

TL;DR: A review of data exfiltration attack vectors and countermeasures revealed that most of the state of the art is focussed on preventive and detective countermeasures and significant research is required on developing investigative countermeasures that are equally important.
About: This article is published in Journal of Network and Computer Applications.The article was published on 2018-01-01 and is currently open access. It has received 76 citations till now.

Summary (11 min read)

Jump to: [1. Introduction][2. Related work][Objective][Included papers][Results][Our Contributions][3. Research Methodology][3.1. Data Source and Search Strategy][3.2. Selection of the Papers][3.3. Data Extraction and Synthesis][4. RQ1. Data Exfiltration Attack Vectors][4.1. Network-based Attack Vectors][4.1.2. Passive monitoring][4.1.3. Timing channels][4.1.4. Virtual machine vulnerabilities][4.1.5. Spyware and Malware][4.1.6. Phishing][4.1.7. Cross Site Scripting][4.2. Physical Attack Vectors][4.2.1. Physical theft][4.2.2. Dumpster diving][5. RQ2. Countermeasures][5.1. Classification of countermeasures][5.1.1.2.1. Mandatory Access Control][5.1.1.2.2. Role-based Access Control][5.1.1.2.3. Discrete Access Control][5.1.2. Detective countermeasures][5.1.2.1.1. Known channel inspection][5.1.2.1.2. Deep packet inspection][Inspecting steganographic traffic][Inspecting encrypted traffic][Inspecting unencrypted traffic][5.1.2.2.1. Network-based anomaly detection][Semi-supervised mode][Unsupervised mode][Supervised mode][NBA][HBA][5.1.3. Investigative countermeasures][5.2. Mapping of Attack Vectors to Countermeasures][6. RQ3. Open and Future Challenges][6.1. Performance][6.2. Evaluation][6.3. Automation][6.4. Privacy, Encrypted Traffic, and Accuracy][6.5. Investigative Countermeasures][6.6. High Cost][7. Limitations] and [8. Conclusion]

1. Introduction

  • Data theft (formally referred to as data exfiltration) is one of the main motivators for cyber-attacks irrespective of whether carried out by organised crime, commercial competitors, state actors or even “hacktivists”.
  • The attack can either be network-based or physical-based.
  • This report also reveals that among all the data leaks, about 24% occurred in the financial sector, about 15% occurred in the healthcare sector, about 15% occurred in retail and accommodation sector, and about 12% occurred in public sector entities.
  • The authors have systematically analysed the countermeasures in terms of their contributions and limitations.
  • As their aim is to survey both attack vectors and a broad set of countermeasures — preventive, detective and investigative — the authors refer to this topic as “data exfiltration” rather than “data leakage prevention”, which implies a specific focus on preventive measures.

Objective

  • The authors review differs from the existing reviews in two ways: (1) whilst all the existing reviews report challenges in preventing or mitigating data exfiltration, their review reports the attack vectors used to exfiltrate data.
  • By challenges, the authors mean factors such as several leakage channels, difficulty in managing access right, encryption, and steganography.
  • Their review does not provide any such details rather provides insight into data exfiltration caused by malicious activities of a remote attacker.
  • Therefore, the scope of this review is limited to data exfiltration from computers, web servers, databases, virtual machines, and network.
  • The authors have excluded the papers that report data exfiltration from domains such as mobile devices, IoT devices, and printers.

Included papers

  • The authors review significantly differs from the existing reviews in terms of the included papers.
  • A major reason for such a huge difference in the pool of papers is that the existing reviews (i.e. [2], [3], and [4]) are primarily focused on the insider attacks and industrial countermeasures while their review focuses on external attackers and research-based countermeasures.
  • The review of Brindha and Shaji [5] is focussed on data exfiltration challenges and does not report any countermeasures.

Results

  • The findings from their review do not overlap with the findings from the existing reviews.
  • Similar to [2] and [3], their review also presents a classification of the countermeasures, however, their criteria and the resulting classification are quite different to the classifications presented in [2] and [3].
  • This classification overlaps with their classification to a certain degree.

Our Contributions

  • This paper provides a broad and structured overview on data exfiltration.
  • Highlight several distinct open issues and challenges that require the immediate attention of the research community.

3. Research Methodology

  • The authors followed a structured process of identifying and selecting the relevant papers from which the relevant data was extracted and analysed to answer the research questions.
  • Table 2 shows the research questions and their respective motivators that stimulated their analysis of the reviewed papers.
  • To get an overview of the countermeasures designed and incorporated by research community for fighting against data exfiltration attacks.

3.1. Data Source and Search Strategy

  • Seven computer science publication databases shown in Table 3 were each queried for four search terms: “data exfiltration”, “data leakage”, “data breach” and “data theft”.
  • The authors derived these search terms from a series of pilot searches, wherein various synonyms for data exfiltration were explored with the aim of finding a set of results which appeared most relevant and neither too broad nor too narrow.
  • In the rest of this survey, the authors use the terms paper, study, and document interchangeably for referring to the papers selected for this survey.
  • Table 3. Database sources Source URL ACM http://portal.acm.org.

3.2. Selection of the Papers

  • After retrieving the documents from these databases, the authors reviewed the title of each document and made a binary decision as to whether the full text would be relevant to the study’s aims (i.e., it appeared to detail either exfiltration attack vectors or countermeasures).
  • After the selection based on the title, the full text of each of the selected papers was reviewed.
  • Some papers were discarded due to the lack of relevance of the full text to their research questions, bringing the pool of papers to its final total of 108 papers.

3.3. Data Extraction and Synthesis

  • After selecting 108 papers, the authors extracted the data using a pre-designed data exfiltration form for answering the research questions.
  • The six steps include: Familiarizing with the data: Data extracted from papers and recorded in excel sheet were read carefully to get deep understanding of the data exfiltration attack vectors, the countermeasures, and the research gaps.
  • After deep understanding of the extracted data, initial codes were assigned to the key points in the data.
  • The themes were analysed to divide the attack vectors and countermeasures into potential themes at multiple levels.
  • The themes at all levels were reviewed and required modifications were made.

4. RQ1. Data Exfiltration Attack Vectors

  • This section reports the results of data analysis about data exfiltration attack vectors.
  • For the details on these attack vectors, readers are referred to [14-16].
  • The authors highlight only those attack vectors that are mostly reported in their included data exfiltration countermeasures (108 countermeasures).
  • Fig. 3 shows that their initial categories are network and physical.

4.1. Network-based Attack Vectors

  • Network-based attack vectors include those vectors that use existing network infrastructure for stealing data from an organization.
  • The identified network-based attack vectors are shown in Fig. 3 and described in following sub-sections.

4.1.2. Passive monitoring

  • Sniffing the wireless broadcast traffic is well-known but usually overlooked data exfiltration vector.
  • With many wireless networks still insufficiently secure [20], and businesses making more and more use of wirelessconnected laptops, tablets, smartphones and other devices, the threat of attackers passively listening to an organisation’s traffic is a very real one, and many connected devices leak information [21].
  • Such broadcastinterceptions are not limited to typical wireless networks.
  • A notable example can be the incident in 2009 when it was discovered that military adversaries of the United States in Iraq were able to access the video feeds of Predator drones by simply listening on the correct channel [22].

4.1.3. Timing channels

  • This method of data exfiltration appears in the literature describing threats but rarely in the literature aiming to detect or prevent exfiltration, yet it is a plausible exfiltration vector for sophisticated attackers.
  • A timing channel is an extremely subtle form of a hidden channel which works by sending innocuous packets to an external recipient at particular times, such that the time delay between packets represents a particular byte value [15].
  • Such a vector is very difficult to detect, as any traffic could potentially be carrying a timing channel, and the communicated information is not embedded in the packets themselves, merely in the delay between them.
  • Examples are channel operating locally via network sockets [23] and even in variations on keyboard usage [24].

4.1.4. Virtual machine vulnerabilities

  • Modern businesses are increasingly making use of Virtual Machines (VM) hosted by a third party.
  • These threats mostly come from co-residency, where a malicious virtual machine is set up on the same physical machine as a target virtual machine.

4.1.5. Spyware and Malware

  • Spywares are installed on user’s computer to monitor user’s activity and report back to a third party [30].
  • Such software is normally used by software providers to enhance their performance by sending relevant updates to users based on their activities.
  • Spyware includes malware, adware, cookies, web bugs, browser hijackers and key loggers [15].
  • Recently designed malware has the capability to scan user’s personal computer for personal information and send back such information as an email attachment to all email contacts of the user.

4.1.6. Phishing

  • Upon visiting the fraudulent website, the user is asked to enter username, password, bank account number and similar details, which ultimately leads to landing sensitive personal information in the hands of the hacker.
  • Some of the famous types of phishing attacks include Deceptive phishing, DNS-based phishing, and Search Engine phishing [32].

4.1.7. Cross Site Scripting

  • Cross Site Scripting (XSS) is another way of stealing personal information from an authenticated session by injecting a malicious script in an attacked website [33].
  • Once the malicious script executes, it gives an attacker full access to the information held by the trusted website [34].
  • XSS is a popular method among attackers for stealing information as evident from the recent OWASP ranking [35] where it is considered as the third biggest attack vector for leakage of personal and sensitive information.

4.2. Physical Attack Vectors

  • Physical attack vectors include those attacks that get unauthorized and illegal physical access to data and move it to a new physical location.
  • The identified physical attack vectors are shown in Fig. 3 and described in the following sub-sections.

4.2.1. Physical theft

  • It is also possible to first print data and then hide printed material while leaving premises of an organization.
  • Instead of copying or printing sensitive data, an attacker may steal some physical device on which sensitive data is stored [14].
  • This may happen due to weak physical security at an organization’s premises or due to the carelessness of some individual employee who left a device unattended.

4.2.2. Dumpster diving

  • At times, organizations adopt weak practices for destroying information (both hard and soft) such as throwing printed documents or CDs into a dustbin.
  • If an organization throws printed documents and CDs in a dustbin, it is quite possible that an adversary may search the dustbin and sees if there is some information of potential value.
  • Establishment of covert channel for exfiltration of data between two VMs hosted on same physical machine 5 Spyware and malware Software used by remote attackers to identify personal information in a computer and send it back to the attacker via some medium such as email attachment. [14, 15, 30, 36] 6 Phishing.
  • An individual is invited to visit a fraudulent website and visiting the website in turn leak the personal information of the individual. [31, 37] 7 8 Physical theft Copying sensitive data to a removable device (CD, DVD, USB etc.) and taking device out of the organization [14-16].

5. RQ2. Countermeasures

  • This section reports results of the data analysis about data exfiltration countermeasures.
  • At the abstract level, these countermeasures can be divided into three categories based upon whether a countermeasure is preventing, detecting or investigating data exfiltration.
  • The number and percentage of the selected studies related to each of these three categories are shown in Fig.
  • It can be seen that the research community is primarily focussing on preventive and detective countermeasures, while investigative countermeasures lack sufficient exploration.
  • Data at rest is the data that is stored in a storage device (hard drive or mobile device) and is not currently under any kind of processing.

5.1. Classification of countermeasures

  • Since the number of countermeasures pertaining to each of the three basic categories (preventive, detective and investigative) was quite high, the authors have further categorized the countermeasures based on their thematic analysis as reported in Section 3.3.
  • In rest/in transit N u m b er & p er ce n ta ge o f st u d ie s Data State Number of papers Percentage of papers Fig. 7. Classification of Data exfiltration countermeasures 5.1.1.
  • These countermeasures are incorporated in the endpoint devices (such as PCs, Laptops, and Servers) to control access to the data resided on these devices or apply particular security tactics (such as encryption, data classification, and cyber deception) to help secure data against exfiltration attacks.
  • Countermeasures Preventive Detective Investigative Data classification Access control (AC) Encryption Cyber Deception Distributed storage Low-level snooping defence Packet inspection Anomaly-based detection Known channel inspection Deep packet inspection Inspecting steganographic traffic Inspecting encrypted traffic Inspecting normal traffic Discrete AC Mandatory AC Role-based AC Network-based anomaly Host-based anomaly Network +.
  • Papers pertaining to each category in Preventive countermeasures.

5.1.1.2.1. Mandatory Access Control

  • Access to resources or data is controlled by the access policy defined by an administrator and enforced via the operating system.
  • Authors provide a detailed experimental evaluation of the proposed approach.
  • The reference monitor takes input from access control module about security policies and ensures the enforcement of these security policies in a MapReduce system.
  • At the same time, the architecture focuses on reducing the burden of key’s management on the client side by storing the security control information for each file in the form of a separate security control file along with the corresponding encrypted data file.
  • Suzuki et al. [59] develop an operating system, called Salvia, with a focus on preventing data leakage.

5.1.1.2.2. Role-based Access Control

  • Unlike mandatory access control where data objects are labelled, in role-based access control users are assigned a particular role (e.g., developer, tester, accountant) and based on the role of the user, access is granted to various resources [47].
  • The authors claim that the proposed approach does not introduce any overhead in performance and deployment, which seems unrealistic based on the fact that introduction of three security layers would definitely have some effect on query processing time.
  • Specified relations between DTE objects then define the access controls across a network.
  • FlowWatcher sits between the user and web application to monitor HTTP requests and responses.
  • The work of Fabian provides some theoretical guidance for the security-aware usage of USB devices in an organization; asset-inventory in organisations need to be extended to output ports on machines and secure configurations for such machines must be put in place to avoid the use of output ports if not authorised.

5.1.1.2.3. Discrete Access Control

  • In discretionary access control, the onus of regulating access to data objects fall on their respective owners [47].
  • The data owner may grant only read access to one user but another user may have both read and write access.
  • Ko et al. [68] implement a kernel-level access control mechanism to discover and notify an end user about data transmission (both authorized and unauthorized) and enables the end user to take the required action.
  • Parties wishing to gain access to some portion of the sensitive data send an autonomous agent to run locally, where it has access to the data in order to search for the particular information its owner requires.
  • The authors tie the dissemination of files via USB storage to the classification of a file.

5.1.2. Detective countermeasures

  • Detective countermeasures aim to detect exfiltration attempts.
  • Unlike preventive countermeasures that are proactive in nature, detective countermeasures are reactive that detect data exfiltration attacks and stop them wherever possible.
  • This figure shows the incorporation of content inspection, host-based anomaly detection, and network-based anomaly detection.
  • If network behaviour deviates from the normal behaviour, transfer of data will be either stopped or security administrator will be alerted.
  • Fig. 11 shows the papers pertaining to each category in detective countermeasures.

5.1.2.1.1. Known channel inspection

  • This is a simple approach where outgoing network traffic is monitored on some known high-risk channel.
  • The ubiquity and simplicity of email as a transfer mechanism, combined with the relative ease with which it can be monitored via a mail proxy, make it a good target for detection systems.
  • The main objective of this work is to help a common user differentiate between phishing and non-phishing emails and so avoid getting into an interaction with such emails.
  • The proposed technique consists of two steps.
  • First, monitoring of the data transmission on the known communication channel; second, computing a relation between data observed on the channel and the confidential data.

5.1.2.1.2. Deep packet inspection

  • Unlike known channel inspection, deep packet inspection monitors all outgoing traffic for an overlap with sensitive data.
  • Such an approach provides higher level of detection capability as it ensures that not even a single data packet goes un-inspected.

Inspecting steganographic traffic

  • One of the common techniques used by hackers is to hide the sensitive data inside other non-sensitive data so that the installed detective system cannot detect the sensitive data.
  • Modern steganography works by identifying either redundant space within innocuous files or unused fields in common communication protocols (including the ubiquitous TCP/IP) and then encoding the message into these overlooked areas [107].
  • Since the proposed approach does not stop the egress of video and only removes the hidden data from the frames, it is not clear whether the removal of such hidden data causes any damage to the quality of the video.
  • The first is application identification – allowing network administers to enforce when traffic, which uses the same protocol (SSH, HTTP), is devoted to a particular application (webmail, video streaming, and social media).
  • If carrier data comes under an attack on its way, this secret message hidden in reserved bits alerts the communicating parties that carrier data is under an attack.

Inspecting encrypted traffic

  • Another technique used by attackers to evade detection is to first encrypt the data and then steal it.
  • The reverse-proxy will decrypt and inspect the client request and then engage in a TLS-encrypted communication with the remote server on behalf of the client.
  • The Data guard sends results of detection process to the Policy authority that checks if data to be exported contains any sensitive signatures before sending the data to a recipient.
  • The authors do not provide any details on the detection rate achieved by the proposed approach.
  • They discuss three possible approaches to handling encrypted communications within this system: detecting misuse of the encryption protocols, altering protocols to allow packet payload analysis, and finally statistical approaches, which examine packet sizes and time intervals.

Inspecting unencrypted traffic

  • A variety of approaches exist that can inspect data that is neither hidden nor encrypted.
  • The proposed approach seems suitable for prevention against malware attacks that gather and send out sensitive information out of the user’s PC via a P2P channel.
  • In order to train the dataset, documents (both confidential and non-confidential) are clustered using K-means algorithm.
  • More rigorous evaluation is required since the paper does not clarify the detection accuracy of the proposed idea or its effects on the overall performance.

5.1.2.2.1. Network-based anomaly detection

  • Network-Based Anomaly (NBA) detection techniques monitor network traffic to determine whether communication flows differ from baseline conditions in terms of traffic volume, source/destination address pairs, diversity of destination addresses, and time of day or the (mis) use of particular network protocols.
  • The difficulty with this category of techniques lies in the determination of the baseline conditions for ‘normal’ user activity.
  • A sufficient number of anomaly-based detection techniques have been presented and delineating all of them is challenging.
  • Network traffic under monitoring is compared with both the classes (normal and abnormal) to decide which class it belongs to.
  • ELM is used for classification of intrusion attempts.

Semi-supervised mode

  • Data gathered from system level events through runtime monitors is represented in the form of quantitative data flow graph.
  • They evaluate a number of implementations of the Local Outlier Factor (LOF) algorithm, with the greatest reduction in processing time coming from the combination of a kd-tree index of neighbours and the Approximated k-Nearest Neighbours algorithm.
  • The system works in three steps: (1) Parser ensures that traffic using IEC 60870-5-104 protocol is compatible with Bro framework (2) learning component categorize packets into whitelists and record timing statistics (3) detector compares each packet with the three whitelists and if it does not match with anyone, then it is considered abnormal network traffic and an alarm is raised.
  • The authors test the developed anomaly detection system with a number of attacks such as malware, man-in-the-middle, and spoofing attack.
  • The collected data is analysed to identify the behaviour patterns of data stealers for alerting organizations against the identified behaviours patterns.

Unsupervised mode

  • DBMSs have their own known vulnerabilities when it comes to data exfiltration, particularly with regard to their common deployment as part of the web applications.
  • They suggest that a “protective shell” be introduced into the DBMS which learns about the legal and illegal query strings for a given application.
  • It is very likely that the introduction of this new layer affects query performance; the paper neither talks about specific performance issues nor provides any evaluation details.
  • Flood and Keane [155] propose a similar approach in the context of cloud services.
  • In their version, a Finite State Machine is built from observations of training data.

Supervised mode

  • Berlin et al. [144] present a malicious behaviour detection system using windows audit logs.
  • Logs are collected from users of an enterprise and sandboxed virtual machine.
  • This labelling is done using VirusTotal that runs around 55 malware engines over the samples to find whether a sample is a malware or not.
  • The proposed approach detects phishing and malware attacks launched to steal users’ access credentials.
  • Whilst users do not typically emit system calls themselves, their typical use of programs is captured by these patterns; at the same time, a model of system calls rather than selected execution of programs would capture maliciously-injected code that a user would be unaware of.

NBA

  • In transit Packet arrival time is used for detection.
  • Any packet that deviates from expected time arrival specify malicious activity.
  • Malware attack, Man-in-themiddle attack, spoofing attack Wüchner et al. [147].
  • In use Monitors database activity in a more effective manner using densitybased outlier for detecting data exfiltration SQL Injection attack, XSS attack Yang et al. [153].
  • In use A technique for collection of news stories and other reports on data stealers from online sources and performing statistical analysis to identify the behaviour patterns of such data stealers and help organizations to be careful against such behaviours.

HBA

  • In use A technique for training database to protect itself from SQL injection attacks SQL Injection attack Flood & Keane [155].
  • In use/in transit Correlates the CPU activity with network activity and any large size data transfer which doesn’t correlate with host CPU is considered exfiltration of data SQL Injection attack, XSS attack, Malware attacks, Phishing attack Myers et al. [157].
  • In use/in transit A combination of network and host-based analysis to detect exfiltration Malware attacks.
  • In use/in transit Framework for gathering network and host data and analysing it for detecting data exfiltration.

5.1.3. Investigative countermeasures

  • After successful data exfiltration, it is usually not possible to reverse its impact.
  • Investigating a data exfiltration incident can help in mitigating the effects of an attack.
  • Information gathered during such investigation can also be useful for other enterprises.
  • Of course, at times it may not be possible for an enterprise to share such sensitive information with broader community due to confidentiality reasons.
  • Fig. 12 shows the papers pertaining to each category in investigative countermeasures.

5.2. Mapping of Attack Vectors to Countermeasures

  • Fig. 13 shows the mapping of the attack vectors identified in section 3 onto the countermeasures reported in section 4.3.
  • This mapping provides a reader with an understanding of which attack vectors are addressed by which countermeasures.
  • It is imperative to mention here that some of the preventive and detective countermeasures adopt an aggressive and proactive approach (instead of reactive) for protecting data.
  • Zhuang et al. [90] propose to distribute data smartly among multiple clouds which would reduce the risk of leaking all data in a single attack.
  • The countermeasures that address attack vectors other than those reported in Section 3 (such as brute-force attack and flooding attack) are also not shown in Fig. 13.

6. RQ3. Open and Future Challenges

  • According to one statistics, the total annual cost of cybercrime is around $400 billion [180].
  • Data exfiltration is the main motivator for these attacks.
  • These systems include Intrusion Detection System, Intrusion Prevention Systems, Security Information and Event Management (SIEM) system, Anti-malware, and Firewalls.
  • Whilst these data exfiltration countermeasures have recently attracted the attention of the research community, there exist several open and challenging issues.

6.1. Performance

  • The in-depth critical analysis of around 108 countermeasures reveals that performance is one of the most critical qualities for systems designed for preventing, detecting, or investigating data exfiltration.
  • The major reason for such poor performance is the large size, high speed, and heterogeneous nature of data dealt with by these systems.
  • It is imperative to collect and analyse this big data in real-time and without causing significant delay in any data transmission process.
  • Furthermore, the selection of features from security event data is another potential approach for improving performance and it needs to be investigated that how feature selection tools and technologies can be developed and incorporated in defence against data exfiltration attacks.

6.2. Evaluation

  • The authors strongly emphasis that a generic framework needs to be developed for the assessment of data exfiltration countermeasures that can guide researchers and practitioners on how to evaluate their systems.
  • These limitations make the datasets unable to reflect the actual strengths and weaknesses of the proposed countermeasure.
  • The authors assert that evaluation framework should be able to guide for evaluating a countermeasure with APTs.
  • For improving the standard of evaluation, the authors also encourage close collaboration between academia and industry.

6.3. Automation

  • A lack of automation impacts performance of the overall system in terms of deployment and response time.
  • The involvement of network administrator and investigating experts increases the cost and makes the incorporation of such systems quite challenging for enterprises.
  • Similarly, personal users are often reluctant to pay attention to security alerts and approvals during data transmission.
  • The dependency on a dedicated human should be reduced to a minimum to make these systems more acceptable for enterprises and personal use.

6.4. Privacy, Encrypted Traffic, and Accuracy

  • With respect to data exfiltration countermeasures, the three terms (i.e. Privacy, Encrypted Traffic, and Accuracy) are closely related.
  • These countermeasures directly monitor the outgoing network traffic generated by users and scan it for detecting sensitive information.
  • To address the privacy and security concerns, the approach of encrypting data before sending it out is broadly adopted.
  • To address this issue, countermeasures have been developed as reported in their review ([111-115]) that can examine the encrypted traffic to detect data exfiltration.

6.5. Investigative Countermeasures

  • In their review, the authors analysed and categorized data exfiltration countermeasures into preventive, detective, and investigative categories.
  • As evident from Fig – 2, research community primarily remains focussed on preventive and detective countermeasures.
  • Whilst the preventive and detective countermeasures are quite crucial for fighting data exfiltration, the authors believe investigative countermeasures are equally important.
  • Furthermore, identifying and prosecuting attackers sends a very strong message to other potential attackers that there are systems in place to track and catch them.

6.6. High Cost

  • Cost is one of the primary concerns both for individuals and especially enterprises while deciding upon the incorporation of a particular system, tool, or technology in their infrastructure.
  • The authors assert that the incorporation of specialized hardware should be discouraged in the design of data exfiltration countermeasures as deploying and maintaining hardware on a large scale would be very much unfeasible for enterprises.
  • Similarly, a thorough investigation is required to explore ways of reducing the cost for storing and maintaining negative data in cyber deception approaches.

7. Limitations

  • There are two reasons for such a limitation: (1) There exists a large number of attack vectors as reported in [14-16] and covering all of them is quite challenging (2) The authors review is focussed on countermeasures and not on attack vectors.
  • The motivation for including attack vectors is to contextualize the discussion on the countermeasures.
  • Similarly, a wide range of literature exists that directly or indirectly address data exfiltration in various domains (mobile computing, IoT devices, Printers); it may not be possible for a single review like ours to cover all such literature.
  • A wide range of papers exist on access control or encryption but it was not the intention of this review to include all those studies.
  • Similarly, extending the survey by following citations from included publications could unveil larger bodies of work on exfiltration methods which did not match their queries.

8. Conclusion

  • Data exfiltration is a serious and ongoing issue in the field of information security.
  • Another critical overview provided by their review is the applicability of countermeasures for particular data state (1: in use, 2: in transit, and 3: at rest), which gives an insight into what particular data states are mostly attacked and how countermeasures protect data in these particular states.
  • The lack of such capability leads to poor performance and response time.
  • This is pertinent given the increasing concerns over a surveillance society.
  • The authors hope that the insights provided in this paper will give academic researchers and industry practitioners with new directions and motivations for enhancing research and development efforts to devise, evaluate, and deploy new and innovative countermeasures for securing against data exfiltration attacks.

Did you find this useful? Give us your feedback

Citations
More filters
Journal ArticleDOI
TL;DR: This survey paper intends to bring all those methods and techniques that could be used to detect different stages of APT attacks, learning methods that need to be applied and where to make the threat detection framework smart and undecipherable for those adapting APT attackers.
Abstract: Threats that have been primarily targeting nation states and their associated entities have expanded the target zone to include the private and corporate sectors. This class of threats, well known as advanced persistent threats (APTs), are those that every nation and well-established organization fears and wants to protect itself against. While nation-sponsored APT attacks will always be marked by their sophistication, APT attacks that have become prominent in corporate sectors do not make it any less challenging for the organizations. The rate at which the attack tools and techniques are evolving is making any existing security measures inadequate. As defenders strive to secure every endpoint and every link within their networks, attackers are finding new ways to penetrate into their target systems. With each day bringing new forms of malware, having new signatures and behavior that is close to normal, a single threat detection system would not suffice. While it requires time and patience to perform APT, solutions that adapt to the changing behavior of APT attacker(s) are required. Several works have been published on detecting an APT attack at one or two of its stages, but very limited research exists in detecting APT as a whole from reconnaissance to cleanup, as such a solution demands complex correlation and fine-grained behavior analysis of users and systems within and across networks. Through this survey paper, we intend to bring all those methods and techniques that could be used to detect different stages of APT attacks, learning methods that need to be applied and where to make your threat detection framework smart and undecipherable for those adapting APT attackers. We also present different case studies of APT attacks, different monitoring methods, and mitigation methods to be employed for fine-grained control of security of a networked system. We conclude this paper with different challenges in defending against APT and opportunities for further research, ending with a note on what we learned during our writing of this paper.

200 citations

Journal ArticleDOI
10 Jan 2021-Sensors
TL;DR: In this paper, the authors compared several machine learning (ML) methods such as k-nearest neighbor (KNN), support vector machine (SVM), decision tree (DT), naive Bayes (NB), random forest (RF), artificial neural network (ANN), and logistic regression (LR) for both binary and multi-class classification on Bot-IoT dataset.
Abstract: In recent years, there has been a massive increase in the amount of Internet of Things (IoT) devices as well as the data generated by such devices. The participating devices in IoT networks can be problematic due to their resource-constrained nature, and integrating security on these devices is often overlooked. This has resulted in attackers having an increased incentive to target IoT devices. As the number of attacks possible on a network increases, it becomes more difficult for traditional intrusion detection systems (IDS) to cope with these attacks efficiently. In this paper, we highlight several machine learning (ML) methods such as k-nearest neighbour (KNN), support vector machine (SVM), decision tree (DT), naive Bayes (NB), random forest (RF), artificial neural network (ANN), and logistic regression (LR) that can be used in IDS. In this work, ML algorithms are compared for both binary and multi-class classification on Bot-IoT dataset. Based on several parameters such as accuracy, precision, recall, F1 score, and log loss, we experimentally compared the aforementioned ML algorithms. In the case of HTTP distributed denial-of-service (DDoS) attack, the accuracy of RF is 99%. Furthermore, other simulation results-based precision, recall, F1 score, and log loss metric reveal that RF outperforms on all types of attacks in binary classification. However, in multi-class classification, KNN outperforms other ML algorithms with an accuracy of 99%, which is 4% higher than RF.

67 citations

Journal ArticleDOI
TL;DR: A Multivocal Literature Review that has systematically selected and reviewed both academic and grey (blogs, web pages, white papers) literature on different aspects of security orchestration published from January 2007 until July 2017 is reported.
Abstract: Organizations use diverse types of security solutions to prevent cyber-attacks. Multiple vendors provide security solutions developed using heterogeneous technologies and paradigms. Hence, it is a challenging rather impossible to easily make security solutions to work an integrated fashion. Security orchestration aims at smoothly integrating multivendor security tools that can effectively and efficiently interoperate to support security staff of a Security Operation Centre (SOC). Given the increasing role and importance of security orchestration, there has been an increasing amount of literature on different aspects of security orchestration solutions. However, there has been no effort to systematically review and analyze the reported solutions. We report a Multivocal Literature Review that has systematically selected and reviewed both academic and grey (blogs, web pages, white papers) literature on different aspects of security orchestration published from January 2007 until July 2017. The review has enabled us to provide a working definition of security orchestration and classify the main functionalities of security orchestration into three main areas—unification, orchestration, and automation. We have also identified the core components of a security orchestration platform and categorized the drivers of security orchestration based on technical and socio-technical aspects. We also provide a taxonomy of security orchestration based on the execution environment, automation strategy, deployment type, mode of task and resource type. This review has helped us to reveal several areas of further research and development in security orchestration.

50 citations

Journal ArticleDOI
TL;DR: This study proposes to use the bio-inspired method of practical swarm optimization (PSO) which automatically select the exclusive features that contain the novel android debug bridge (ADB) to enhance the machine learning prediction that detects unknown root exploit.
Abstract: The increasing demand for Android mobile devices and blockchain has motivated malware creators to develop mobile malware to compromise the blockchain. Although the blockchain is secure, attackers have managed to gain access into the blockchain as legal users, thereby comprising important and crucial information. Examples of mobile malware include root exploit, botnets, and Trojans and root exploit is one of the most dangerous malware. It compromises the operating system kernel in order to gain root privileges which are then used by attackers to bypass the security mechanisms, to gain complete control of the operating system, to install other possible types of malware to the devices, and finally, to steal victims' private keys linked to the blockchain. For the purpose of maximizing the security of the blockchain-based medical data management (BMDM), it is crucial to investigate the novel features and approaches contained in root exploit malware. This study proposes to use the bio-inspired method of practical swarm optimization (PSO) which automatically select the exclusive features that contain the novel android debug bridge (ADB). This study also adopts boosting (adaboost, realadaboost, logitboost, and multiboost) to enhance the machine learning prediction that detects unknown root exploit, and scrutinized three categories of features including (1) system command, (2) directory path and (3) code-based. The evaluation gathered from this study suggests a marked accuracy value of 93% with Logitboost in the simulation. Logitboost also helped to predicted all the root exploit samples in our developed system, the root exploit detection system (RODS).

48 citations


Cites background from "Data exfiltration"

  • ...the common types of malware available such as root exploit, botnet, spyware, worm, and Trojan, the most dangerous is root exploit, also known as rootkit [24, 25]....

    [...]

Journal ArticleDOI
TL;DR: Not only can confidential terms be accurately detected but also the sophisticated rephrased confidential contents are detected during the experiments, and the redundancy terms and noise terms are removed.
Abstract: Early data leakage protection methods for smart mobile devices usually focus on confidential terms and their context, which truly prevent some kinds of data leakage events. However, with the high dimensionality and redundancy of text data, it is difficult to detect the documents which contain confidential contents accurately. Our approach updates cluster graph structure based on CBDLP (Data Leakage Protection Based on Context) model by computing the importance of confidential terms and the terms within the range of their context. By applying CBDLP with pruning procedure which has been validated, we further remove the redundancy terms and noise terms. Actually, not only can confidential terms be accurately detected but also the sophisticated rephrased confidential contents are detected during the experiments.

40 citations

References
More filters
Journal ArticleDOI
TL;DR: Thematic analysis is a poorly demarcated, rarely acknowledged, yet widely used qualitative analytic method within psychology as mentioned in this paper, and it offers an accessible and theoretically flexible approach to analysing qualitative data.
Abstract: Thematic analysis is a poorly demarcated, rarely acknowledged, yet widely used qualitative analytic method within psychology. In this paper, we argue that it offers an accessible and theoretically flexible approach to analysing qualitative data. We outline what thematic analysis is, locating it in relation to other qualitative analytic methods that search for themes or patterns, and in relation to different epistemological and ontological positions. We then provide clear guidelines to those wanting to start thematic analysis, or conduct it in a more deliberate and rigorous way, and consider potential pitfalls in conducting thematic analysis. Finally, we outline the disadvantages and advantages of thematic analysis. We conclude by advocating thematic analysis as a useful and flexible method for qualitative research in and beyond psychology.

103,789 citations


"Data exfiltration" refers methods in this paper

  • ...The extracted attack vectors and countermeasures from the primary studies were analysed using qualitative analysis technique, namely thematic analysis [13]....

    [...]

  • ...by Braun and Clarke [13] to produce the results presented in Section 4, 5, and 6....

    [...]

  • ...We categorize the included attack vectors based on the guidelines of thematic analysis [13]....

    [...]

  • ...We followed the six-step process developed by Braun and Clarke [13] to produce the results presented in Section 4, 5, and 6....

    [...]

Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

20,309 citations

Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

17,663 citations

Journal ArticleDOI
TL;DR: A new learning algorithm called ELM is proposed for feedforward neural networks (SLFNs) which randomly chooses hidden nodes and analytically determines the output weights of SLFNs which tends to provide good generalization performance at extremely fast learning speed.

10,217 citations


"Data exfiltration" refers methods in this paper

  • ...[125] present a network intrusion detection approach that is based on Extreme Learning Machine (ELM) [126, 127]....

    [...]

Journal ArticleDOI
TL;DR: This survey tries to provide a structured and comprehensive overview of the research on anomaly detection by grouping existing techniques into different categories based on the underlying approach adopted by each technique.
Abstract: Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and more succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the different directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.

9,627 citations


"Data exfiltration" refers background in this paper

  • ...Such unexpected pattern or behaviour is referred in different ways such as anomaly, exception, surprise, outliers, aberrations, and peculiarities [7, 123]....

    [...]

  • ...For example, there are a number of reviews on network anomaly detection [7-11]....

    [...]

Frequently Asked Questions (13)
Q1. What have the authors contributed in "Data exfiltration: a review of external attack vectors and countermeasures" ?

One of the main targets of cyber-attacks is data exfiltration, which is the leakage of sensitive or private data to an unauthorized entity. This paper is aimed at identifying and critically analysing data exfiltration attack vectors and countermeasures for reporting the status of the art and determining gaps for future research. The authors have followed a structured process for selecting 108 papers from seven publication databases. This review has revealed that ( a ) most of the state of the art is focussed on preventive and detective countermeasures and significant research is required on developing investigative countermeasures that are equally important ; ( b ) Several data exfiltration countermeasures are not able to respond in real-time, which specifies that research efforts need to be invested to enable them to respond in real-time ( c ) A number of data exfiltration countermeasures do not take privacy and ethical concerns into consideration, which may become an obstacle in their full adoption ( d ) Existing research is primarily focussed on protecting data in ‘ in use ’ state, therefore, future research needs to be directed towards securing data in ‘ in rest ’ and ‘ in transit ’ states ( e ) There is no standard or framework for evaluation of data exfiltration countermeasures. Furthermore, the authors have explored the applicability of various countermeasures for different states of data ( i. e., in use, in transit, or at rest ). 

Fig. 14. Future Research Challenges in Defence against Data Exfiltration 

Perhaps the most direct method of data exfiltration for a remote attacker is manipulating a public-facing server into disclosing non-public information, such as through the well-known category of SQL injection attacks. 

Physical attack vectors include those attacks that get unauthorized and illegal physical access to data and move it to a new physical location. 

It is important to enforce authentication and authorization mechanisms for ensuring that only legitimate users with the required credentials can access the data. 

Intelligent and planned outsourcing of data to several clouds seems a good idea for reducing the risk of data leakage in cloud environments, however, organizations may be reluctant in adopting such an approach due to the extra storage cost and complexity of data management. 

They discuss three possible approaches to handling encrypted communications within this system: detecting misuse of the encryption protocols, altering protocols to allow packet payload analysis, and finally statistical approaches, which examine packet sizes and time intervals. 

The proposed approach can help in preventing the access of a hacker, who had stolen the credentials (username and password) of a user or website admin using an attack vector such as phishing, spyware, or XSS, to personal information resided in the rest state in a cloud. 

Apart from their wide-scale adoption, the authors also believe that it is quite risky not to adopt encrypted traffic transmission approach because it leaves open the option of data exfiltration via passive monitoring. 

The high-level dependency on human experts and hardware devices make these countermeasures very expensive to be incorporated by enterprises. 

Due to the unavailability of the required SGX hardware, even the authors could not evaluate the efficiency of the proposed approach. 

The high processing and storage capability (8 core processor and 32 GB main memory) may hinder the adaptation of the proposed approach to ensuring controlled access in a system. 

This labelling is done using VirusTotal that runs around 55 malware engines over the samples to find whether a sample is a malware or not.