scispace - formally typeset
Open AccessJournal ArticleDOI

GOTCHA! Network-Based Fraud Detection for Social Security Fraud

Reads0
Chats0
TLDR
It is found that domain-driven network variables have a significant impact on detecting past and future frauds and improve the baseline by detecting up to 55% additional fraudsters over time.
Abstract
We study the impact of network information for social security fraud detection. In a social security system, companies have to pay taxes to the government. This study aims to identify those compani...

read more

Content maybe subject to copyright    Report

Submitted to Management Science
manuscript MS-14-00232
Authors are encouraged to submit new papers to INFORMS journals by means of
a style file template, which includes the journal title. However, use of a template
does not certify that the paper has been accepted for publication in the named jour-
nal. INFORMS journal templates are for the exclusive purpose of submitting to an
INFORMS journal and should not be used to distribute the papers in print or online
or to submit the papers to another publication.
GOTCHA! Network-based Fraud Detection for Social
Security Fraud
Dr. V´eronique Van Vlasselaer
Department of Decision Sciences and Information Management, KU Leuven, Leuven, Belgium,
Veronique.VanVlasselaer@kuleuven.be
Prof. Dr. Tina Eliassi-Rad
Department of Computer Science, Rutgers University, Piscataway, NJ, USA, tina@eliassi.org
Prof. Dr. Leman Akoglu
Department of Computer Science, Stony Brook University, Stony Brook, NY, USA, leman@cs.stonybrook.edu
Prof. Dr. Monique Snoeck
Department of Decision Sciences and Information Management, KU Leuven, Leuven, Belgium, Monique.Snoeck@kuleuven.be
Prof. Dr. Bart Baesens
Department of Decision Sciences and Information Management, KU Leuven, Leuven, Belgium, Bart.Baesens@kuleuven.be
School of Management, University of Southampton, Highfield, Southampton, SO17 1BJ, United Kingdom
We study the impact of network information for social security fraud detection. In a social security system,
companies have to pay taxes to the government. This study aims to identify those companies that intention-
ally go bankrupt in order to avoid contributing their taxes. We link companies to each other through their
shared resources, as some resources are the instigators of fraud. We introduce GOTCHA!, a new approach
on how to define and extract features from a time-weighted network, and how to exploit and integrate
network-based and intrinsic features in fraud detection. The GOTCHA! propagation algorithm diffuses fraud
through the network, labeling the unknown and anticipating future fraud whilst simultaneously decaying
the importance of past fraud. We find that domain-driven network variables have a significant impact on
detecting past and future frauds, and improve the baseline by detecting up to 55% additional fraudsters over
time.
Key words : fraud detection, network analysis, bipartite graphs, fraud propagation, guilt-by-association
History : This paper was first submitted on February 5, 2014.
1. Introduction
Fraud detection is a research domain with a wide variety of different applications and differ-
ent requirements, including credit card fraud (Chan and Stolfo 1998, Quah and Sriganesh 2008,
1

Author: GOTCHA! Network-based Fraud Detection for Social Security Fraud
2 Article submitted to Management Science; manuscript no. MS-14-00232
anchez et al. 2009), call record fraud (Fawcett and Provost 1997), money laundering (Gao and
Ye 2007, Jensen 1997), insurance fraud (Dionne et al. 2009, Furlan and Bajec 2008, Phua et al.
2004) and telecommunications fraud (Hilas and Sahalos 2005, Est´evez et al. 2006). The aforemen-
tioned problems generally exhibit the same characteristics, but the solution to each problem is
rather domain-specific (Chandola et al. 2009). Data mining techniques i.e., finding patterns and
anomalies in large amounts of data have already proven useful in risk evaluation (Baesens et al.
2003a,b), but fraud is an atypical example and requires built-in domain knowledge.
We introduce GOTCHA!, a new, generic, scalable and integrated approach on how (social)
network analytics can improve the performance of traditional fraud detection tools in a social
security context. We identify five challenges that concur with fraud. That is, fraud is an uncommon,
well-considered, time-evolving, carefully organized and imperceptibly concealed crime that appears
in many different types and forms. Whereas current research fails to integrate all these dimensions
into one encompassing approach, GOTCHA! is the first to address each of these challenges
together in one high-performance, time-dependent detection technique.
In short, GOTCHA! contributes to the fraud detection domain by proposing a novel approach
on how to spread fraud through a (i) time-weighted network and features extracted from a (ii)
bipartite graph (cfr. infra). We exploit dynamic network-based features derived from the direct
neighborhood, and develop a new propagation algorithm that infers an initial exposure score for
each node using the whole network. The exposure score measures the extent to which a node is
influenced by fraudulent nodes. We integrate both intrinsic and network-based features into one
scalable algorithm. We argue that fraud is a time-dependent phenomenon, and as a consequence
GOTCHA! is designed such that a subject’s characteristics and fraud probability can change over
time.
We test the validity of our approach on a real data set obtained from the Belgian social security
institution, which registers and monitors every active company in Belgium and keeps track of all
resources, and their associations with companies.
1
In a social security system, companies have to
pay employer and employee contributions to the government. Fraud occurs when companies inten-
tionally go bankrupt in order to avoid paying these taxes. A new/existing company with (partly)
the same structure is founded afterwards and continues the activities of the former company. We
can compare the structures of companies through their resources.
1
Due to confidentiality issues, we will not elaborate further upon the exact type of resources, but the reader can
understand shared resources in terms of the same address, equipment, buyers, suppliers, employees, etc.

Author: GOTCHA! Network-based Fraud Detection for Social Security Fraud
Article submitted to Management Science; manuscript no. MS-14-00232 3
Side Company 3 Side Company 2
Side Company 5 Side Company 6
Side Company 4 Side Company 1Key Company
(a)
B
C
D
E
F
G
H
A
I
J
K
High-risk
Low-risk
Side Company 1
Side Company 2
Side Company 3
Side Company 4
Side Company 5
Side Company 6
Key Company
Unobservable
Observable
link
link
Resources
(b)
Figure 1 (a) Example of a spider construction. Company 1 and 4 are fraudulent. Resources are
transferred towards other companies (solid line). The key company organizes the fraudulent setup, but
its links to other companies are hidden (dashed line). (b) Bipartite graph of the spider construction.
Companies are indirectly connected to each other through the resources.
A spider construction is a fraudulent setup with an active exchange of resources between the
companies, i.e., fraudulent companies do not transfer all of their resources to only one other
company as this might attract too much attention (see Figure 1a). They rather distribute their
resources among many companies. Active companies that inherit resources from fraudulent com-
panies, exhibit a high risk of perpetrating fraud themselves. In particular, we distinguish between
the key and side companies. The side companies are the perpetrators of the fraud and have an
observable link to each other through shared resources. The core of a spider construction is the
key company, which is responsible for organizing the fraud, setting up many side companies and
pruning away their profits, so that they go bankrupt. However, the key company has unobservable
links, and therefore we can only detect the side companies. The main goal of GOTCHA! is to
exploit the associations between companies and their resources to infer which companies have a
high risk to commit fraud in the future. We believe that network-based knowledge might strongly
improve the standard approaches, which only use intrinsic variables in the detection models.
In order to assess the added value of our approach, we compare GOTCHA! to three baselines:
(1) an intrinsic model, only including intrinsic features; (2) a unipartite model, linking companies
directly together by means of the resources they shared or transferred among each other; (3) a
bipartite model, which starts from the same network representation as our GOTCHA! model,

Author: GOTCHA! Network-based Fraud Detection for Social Security Fraud
4 Article submitted to Management Science; manuscript no. MS-14-00232
integrating both companies and resources (see Figure 1b). Yet, the model is not time-weighted. Our
results show that an optimal mix between intrinsic and time-weighted network-based attributes
contribute to a higher accuracy and more precise output than the baselines. Moreover, it appears
that many regular (i.e., non-intentional) bankruptcy companies are also outputted and classified
as high risk. This is a strong indication that the developed approach is also able to find those
companies that committed fraud, but were not caught in the past. As a result, we argue that our
approach is suitable for both future and retrospective fraud detection.
This paper is organized as follows: Section 2 motivates GOTCHA! ’s fraud detection process and
framework, as well as GOTCHA! ’s contributions to existing research. Section 3 focuses on how
network analysis is implemented for fraud detection. This section also discusses GOTCHA!’s prop-
agation algorithm and how domain-driven networked features are defined and extracted from the
network. Section 4 summarizes the modeling approach. Section 5 contains the results of GOTCHA!
on social security fraud data. Section 6 concludes this paper.
2. Social Security Fraud Detection
2.1. Background
The Belgian Social Security Institution is a federal agency that monitors the tax contributions
of every active company in Belgium. These contributions are used to fund the various branches
in social security, such as family allowance funds, unemployment funds, health insurance, holiday
funds, etc. Companies or in general terms, the employers need to pay employer and employee
contributions to the government. Some companies, nevertheless, fail to redeem their obligations
and file for bankruptcy. Recently, experts found evidence of fraudulent setups through bankruptcy.
In real data, we observe small “webs of fraud”, the so-called spider constructions. A spider
construction consists of (fraudulent) companies that are closely connected to each other through
shared or transferred resources. Resources include address, equipment, buyers, suppliers, employees,
etc. For example, two companies are associated with each other because they operate at the same
location. The data reveals which resource is associated with which company for which specific
time period. We observe that the profits of companies that belong to a fraudulent setup are
often pruned away by a hidden key company (see Figure 1). Consequently, the company becomes
insolvent and files for bankruptcy, leaving the government with unrecoverable debt claims. We
see, however, that their operational resources move towards other currently legitimate or newly
founded companies, e.g., 80% of the resources of the fraudulent company are re-used by a new
or currently legitimate company. Those companies will continue the activities of the fraudulent
company. The transfer (or sharing) of such resources induces the observable structure of spider

Author: GOTCHA! Network-based Fraud Detection for Social Security Fraud
Article submitted to Management Science; manuscript no. MS-14-00232 5
Year t
4
Year t
3
Year t
2
Year t
1
Year t
0
Year t
1
Year t
2
Year t
3
200000
215000
230000
Timestamp
Active Companies
0.1%
0.2%
0.3%
Fraudulent Companies
Active
Fraud
Figure 2 Overview of the total number of active companies (blue curve) and fraudulent companies
(red curve). The number of active companies is consistently growing. A similar trend can be noticed in
the number of fraudulent companies.
constructions. Companies that inherit (many) resources of fraudulent companies, exhibit a high risk
of perpetrating fraud in the future as well. Figure 1b shows how (groups of) resources are exchanged
between various companies, transferring fraudulent knowledge on how to commit fraud (Levin and
Cross 2004) towards legitimate companies. We must note that resource sharing is nevertheless
a normal activity in the corporate environment, complicating the detection process. Although
the exact procedure of resource sharing is confidential, the reader can think in terms of e.g., the
transfer or sharing of employees, equipment, buyers/suppliers, and addresses taken over by other
employers, etc. The requirements of fraud experts are threefold: (1) curtailing the growth of existing
spider constructions; (2) preventing the development of new spider constructions; and (3) detecting
uncaught spider constructions, i.e., dense subgraphs in the network with many bankruptcies. In
this work, we focus on requirement (1) and (2). Recall that we do not have information to associate
key companies to their side companies. Therefore, we aim to find suspicious side companies.
2.2. Challenges
A first contribution of this research is the investigation and identification of the underlying
reasons why fraud detection cannot be resolved by applying standard data analytics. We identify
five challenges present in most fraud detection problems, and discuss how each challenge can be
addressed. In general, the main challenges that characterize fraud are as follows:

Citations
More filters
Book

Fraud Analytics Using Descriptive, Predictive And Social Network Techniques: A Guide To Data Science For Fraud Detection

TL;DR: Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques helps you stop fraud in its tracks, and eliminate the opportunities for future occurrence.
Journal ArticleDOI

The value of big data for credit scoring: Enhancing financial inclusion using mobile phone data and social network analytics

TL;DR: The results show that combining call-detail records with traditional data in credit scoring models significantly increases their performance when measured in AUC, and the calling behavior features are the most predictive in other models, both in terms of statistical and economic performance.
Journal ArticleDOI

Generative adversarial network based telecom fraud detection at the receiving bank

TL;DR: A new generative adversarial network (GAN) based model is proposed to calculate for each large transfer a probability that it is fraudulent, such that the bank can take appropriate measures to prevent potential fraudsters to take the money if the probability exceeds a threshold.
Journal ArticleDOI

Social network analytics for churn prediction in telco

TL;DR: The study statistically evaluates the effect of relational classifiers and collective inference methods on the predictive power of relational learners, as well as the performance of models where relational learners are combined with traditional methods of predicting customer churn in the telecommunication industry, and provides guidelines on how to apply social networks analytics for churn prediction in the telecommunications industry in an optimal way.
Journal ArticleDOI

Auto loan fraud detection using dominance-based rough set approach versus machine learning methods

TL;DR: This paper tests a new data set for auto loan applications using a technique not yet explored for financial fraud prediction, namely the Dominance-based Rough Set Balanced Rule Ensemble (DRSA-BRE), and finds that the proposed approach has several advantages over the traditional ones.
References
More filters
Journal ArticleDOI

Random Forests

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Journal ArticleDOI

SMOTE: synthetic minority over-sampling technique

TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Proceedings Article

The PageRank Citation Ranking : Bringing Order to the Web

TL;DR: This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them, and shows how to efficiently compute PageRank for large numbers of pages.
Journal ArticleDOI

SMOTE: Synthetic Minority Over-sampling Technique

TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Related Papers (5)
Frequently Asked Questions (9)
Q1. What are the contributions in "Gotcha! network-based fraud detection for social security fraud" ?

However, use of a template does not certify that the paper has been accepted for publication in the named journal. INFORMS journal templates are for the exclusive purpose of submitting to an INFORMS journal and should not be used to distribute the papers in print or online or to submit the papers to another publication. 

Their future work will elaborate more on active learning, by updating the model using both correctly and incorrectly classified instances. Another topic for future research is community detection which may find groups of suspicious companies. Although the authors applied their approach to social security fraud detection, they have promising results that their proposed framework can be employed for the detection of other fraud types where the network can be represented as a higher order graph ( n-partite graph ). 

improves the intrinsic baseline by detecting 31%, 33% and 33% more fraudulent and high-risk cases for the respective timestamps, resulting in a higher precision and recall. 

The iterative propagation procedure for bipartite graphs can then be written as,(~ξ) = α ·Qnorm(~ξ) + (1−α) ·~v (5)Note that Qnorm is a dynamic matrix, representing both present and past relationships. 

The adjacency matrix of an undirected bipartite graph is formally written as An×m = (ai,j), with ai,j = 1 if a link between node i ∈ V1 and node j ∈ V2 exists, and ai,j = 0 otherwise. 

Their future work will elaborate more on active learning, by updating the model using both correctly and incorrectly classified instances. 

MS-14-00232 13corresponding matrix representation of size n× n of a graph, with n being the total number of vertices and ai,j = 1 if a link between node i and j exists, and ai,j = 0 otherwise. 

The authors repeat the process for 100 iterations in order to make sure that5 based on Page et al. (1998), the authors choose α= 0.85potential changes in the final exposure score are only marginal. 

Those variables include the degree, triangles and propagated exposure score (see Section 3.4 for details), and can be classified as direct and indirect network variables depending on whether they are derived from the direct neighborhood or take into account the full networkstructure.