What future works have the authors mentioned in the paper "Gotcha! network-based fraud detection for social security fraud" ?

Their future work will elaborate more on active learning, by updating the model using both correctly and incorrectly classified instances. Another topic for future research is community detection which may find groups of suspicious companies. Although the authors applied their approach to social security fraud detection, they have promising results that their proposed framework can be employed for the detection of other fraud types where the network can be represented as a higher order graph ( n-partite graph ).

How does GOTCHA improve the intrinsic baseline?

improves the intrinsic baseline by detecting 31%, 33% and 33% more fraudulent and high-risk cases for the respective timestamps, resulting in a higher precision and recall.

What is the iterative propagation procedure for bipartite graphs?

The iterative propagation procedure for bipartite graphs can then be written as,(~ξ) = α ·Qnorm(~ξ) + (1−α) ·~v (5)Note that Qnorm is a dynamic matrix, representing both present and past relationships.

What is the adjacency matrix of a bipartite graph?

The adjacency matrix of an undirected bipartite graph is formally written as An×m = (ai,j), with ai,j = 1 if a link between node i ∈ V1 and node j ∈ V2 exists, and ai,j = 0 otherwise.

How will the future work elaborate on active learning?

Their future work will elaborate more on active learning, by updating the model using both correctly and incorrectly classified instances.

What is the corresponding matrix representation of size of a graph?

MS-14-00232 13corresponding matrix representation of size n× n of a graph, with n being the total number of vertices and ai,j = 1 if a link between node i and j exists, and ai,j = 0 otherwise.

How many iterations of the process are needed to make sure that the final exposure score is?

The authors repeat the process for 100 iterations in order to make sure that5 based on Page et al. (1998), the authors choose α= 0.85potential changes in the final exposure score are only marginal.

What are the types of variables that can be classified as direct and indirect?

Those variables include the degree, triangles and propagated exposure score (see Section 3.4 for details), and can be classified as direct and indirect network variables depending on whether they are derived from the direct neighborhood or take into account the full networkstructure.

(Open Access) GOTCHA! Network-Based Fraud Detection for Social Security Fraud (2017) | Véronique Van Vlasselaer

Submitted to Management Science

manuscript MS-14-00232

Authors are encouraged to submit new papers to INFORMS journals by means of

a style ﬁle template, which includes the journal title. However, use of a template

does not certify that the paper has been accepted for publication in the named jour-

nal. INFORMS journal templates are for the exclusive purpose of submitting to an

INFORMS journal and should not be used to distribute the papers in print or online

or to submit the papers to another publication.

GOTCHA! Network-based Fraud Detection for Social

Security Fraud

Dr. V´eronique Van Vlasselaer

Department of Decision Sciences and Information Management, KU Leuven, Leuven, Belgium,

Veronique.VanVlasselaer@kuleuven.be

Prof. Dr. Tina Eliassi-Rad

Department of Computer Science, Rutgers University, Piscataway, NJ, USA, tina@eliassi.org

Prof. Dr. Leman Akoglu

Department of Computer Science, Stony Brook University, Stony Brook, NY, USA, leman@cs.stonybrook.edu

Prof. Dr. Monique Snoeck

Department of Decision Sciences and Information Management, KU Leuven, Leuven, Belgium, Monique.Snoeck@kuleuven.be

Prof. Dr. Bart Baesens

Department of Decision Sciences and Information Management, KU Leuven, Leuven, Belgium, Bart.Baesens@kuleuven.be

School of Management, University of Southampton, Highﬁeld, Southampton, SO17 1BJ, United Kingdom

We study the impact of network information for social security fraud detection. In a social security system,

companies have to pay taxes to the government. This study aims to identify those companies that intention-

ally go bankrupt in order to avoid contributing their taxes. We link companies to each other through their

shared resources, as some resources are the instigators of fraud. We introduce GOTCHA!, a new approach

on how to deﬁne and extract features from a time-weighted network, and how to exploit and integrate

network-based and intrinsic features in fraud detection. The GOTCHA! propagation algorithm diﬀuses fraud

through the network, labeling the unknown and anticipating future fraud whilst simultaneously decaying

the importance of past fraud. We ﬁnd that domain-driven network variables have a signiﬁcant impact on

detecting past and future frauds, and improve the baseline by detecting up to 55% additional fraudsters over

time.

Key words : fraud detection, network analysis, bipartite graphs, fraud propagation, guilt-by-association

History : This paper was ﬁrst submitted on February 5, 2014.

1. Introduction

Fraud detection is a research domain with a wide variety of diﬀerent applications and diﬀer-

ent requirements, including credit card fraud (Chan and Stolfo 1998, Quah and Sriganesh 2008,

Author: GOTCHA! Network-based Fraud Detection for Social Security Fraud

2 Article submitted to Management Science; manuscript no. MS-14-00232

S´anchez et al. 2009), call record fraud (Fawcett and Provost 1997), money laundering (Gao and

Ye 2007, Jensen 1997), insurance fraud (Dionne et al. 2009, Furlan and Bajec 2008, Phua et al.

2004) and telecommunications fraud (Hilas and Sahalos 2005, Est´evez et al. 2006). The aforemen-

tioned problems generally exhibit the same characteristics, but the solution to each problem is

rather domain-speciﬁc (Chandola et al. 2009). Data mining techniques – i.e., ﬁnding patterns and

anomalies in large amounts of data – have already proven useful in risk evaluation (Baesens et al.

2003a,b), but fraud is an atypical example and requires built-in domain knowledge.

We introduce GOTCHA!, a new, generic, scalable and integrated approach on how (social)

network analytics can improve the performance of traditional fraud detection tools in a social

security context. We identify ﬁve challenges that concur with fraud. That is, fraud is an uncommon,

well-considered, time-evolving, carefully organized and imperceptibly concealed crime that appears

in many diﬀerent types and forms. Whereas current research fails to integrate all these dimensions

into one encompassing approach, GOTCHA! is the ﬁrst to address each of these challenges

together in one high-performance, time-dependent detection technique.

In short, GOTCHA! contributes to the fraud detection domain by proposing a novel approach

on how to spread fraud through a (i) time-weighted network and features extracted from a (ii)

bipartite graph (cfr. infra). We exploit dynamic network-based features derived from the direct

neighborhood, and develop a new propagation algorithm that infers an initial exposure score for

each node using the whole network. The exposure score measures the extent to which a node is

inﬂuenced by fraudulent nodes. We integrate both intrinsic and network-based features into one

scalable algorithm. We argue that fraud is a time-dependent phenomenon, and as a consequence

GOTCHA! is designed such that a subject’s characteristics and fraud probability can change over

time.

We test the validity of our approach on a real data set obtained from the Belgian social security

institution, which registers and monitors every active company in Belgium and keeps track of all

resources, and their associations with companies.

In a social security system, companies have to

pay employer and employee contributions to the government. Fraud occurs when companies inten-

tionally go bankrupt in order to avoid paying these taxes. A new/existing company with (partly)

the same structure is founded afterwards and continues the activities of the former company. We

can compare the structures of companies through their resources.

Due to conﬁdentiality issues, we will not elaborate further upon the exact type of resources, but the reader can

understand shared resources in terms of the same address, equipment, buyers, suppliers, employees, etc.

Author: GOTCHA! Network-based Fraud Detection for Social Security Fraud

Article submitted to Management Science; manuscript no. MS-14-00232 3

Side Company 3 Side Company 2

Side Company 5 Side Company 6

Side Company 4 Side Company 1Key Company

(a)

High-risk

Low-risk

Side Company 1

Side Company 2

Side Company 3

Side Company 4

Side Company 5

Side Company 6

Key Company

Unobservable

Observable

link

Resources

(b)

Figure 1 (a) Example of a spider construction. Company 1 and 4 are fraudulent. Resources are

transferred towards other companies (solid line). The key company organizes the fraudulent setup, but

its links to other companies are hidden (dashed line). (b) Bipartite graph of the spider construction.

Companies are indirectly connected to each other through the resources.

A spider construction is a fraudulent setup with an active exchange of resources between the

companies, i.e., fraudulent companies do not transfer all of their resources to only one other

company as this might attract too much attention (see Figure 1a). They rather distribute their

resources among many companies. Active companies that inherit resources from fraudulent com-

panies, exhibit a high risk of perpetrating fraud themselves. In particular, we distinguish between

the key and side companies. The side companies are the perpetrators of the fraud and have an

observable link to each other through shared resources. The core of a spider construction is the

key company, which is responsible for organizing the fraud, setting up many side companies and

pruning away their proﬁts, so that they go bankrupt. However, the key company has unobservable

links, and therefore we can only detect the side companies. The main goal of GOTCHA! is to

exploit the associations between companies and their resources to infer which companies have a

high risk to commit fraud in the future. We believe that network-based knowledge might strongly

improve the standard approaches, which only use intrinsic variables in the detection models.

In order to assess the added value of our approach, we compare GOTCHA! to three baselines:

(1) an intrinsic model, only including intrinsic features; (2) a unipartite model, linking companies

directly together by means of the resources they shared or transferred among each other; (3) a

bipartite model, which starts from the same network representation as our GOTCHA! model,

Author: GOTCHA! Network-based Fraud Detection for Social Security Fraud

4 Article submitted to Management Science; manuscript no. MS-14-00232

integrating both companies and resources (see Figure 1b). Yet, the model is not time-weighted. Our

results show that an optimal mix between intrinsic and time-weighted network-based attributes

contribute to a higher accuracy and more precise output than the baselines. Moreover, it appears

that many regular (i.e., non-intentional) bankruptcy companies are also outputted and classiﬁed

as high risk. This is a strong indication that the developed approach is also able to ﬁnd those

companies that committed fraud, but were not caught in the past. As a result, we argue that our

approach is suitable for both future and retrospective fraud detection.

This paper is organized as follows: Section 2 motivates GOTCHA! ’s fraud detection process and

framework, as well as GOTCHA! ’s contributions to existing research. Section 3 focuses on how

network analysis is implemented for fraud detection. This section also discusses GOTCHA!’s prop-

agation algorithm and how domain-driven networked features are deﬁned and extracted from the

network. Section 4 summarizes the modeling approach. Section 5 contains the results of GOTCHA!

on social security fraud data. Section 6 concludes this paper.

2. Social Security Fraud Detection

2.1. Background

The Belgian Social Security Institution is a federal agency that monitors the tax contributions

of every active company in Belgium. These contributions are used to fund the various branches

in social security, such as family allowance funds, unemployment funds, health insurance, holiday

funds, etc. Companies – or in general terms, the employers – need to pay employer and employee

contributions to the government. Some companies, nevertheless, fail to redeem their obligations

and ﬁle for bankruptcy. Recently, experts found evidence of fraudulent setups through bankruptcy.

In real data, we observe small “webs of fraud”, the so-called spider constructions. A spider

construction consists of (fraudulent) companies that are closely connected to each other through

shared or transferred resources. Resources include address, equipment, buyers, suppliers, employees,

etc. For example, two companies are associated with each other because they operate at the same

location. The data reveals which resource is associated with which company for which speciﬁc

time period. We observe that the proﬁts of companies that belong to a fraudulent setup are

often pruned away by a hidden key company (see Figure 1). Consequently, the company becomes

insolvent and ﬁles for bankruptcy, leaving the government with unrecoverable debt claims. We

see, however, that their operational resources move towards other currently legitimate or newly

founded companies, e.g., 80% of the resources of the fraudulent company are re-used by a new

or currently legitimate company. Those companies will continue the activities of the fraudulent

company. The transfer (or sharing) of such resources induces the observable structure of spider

Author: GOTCHA! Network-based Fraud Detection for Social Security Fraud

Article submitted to Management Science; manuscript no. MS-14-00232 5

Year t

−4

Year t

−3

Year t

−2

Year t

−1

Year t

200000

215000

230000

Timestamp

Active Companies

0.1%

0.2%

0.3%

Fraudulent Companies

Active

Fraud

Figure 2 Overview of the total number of active companies (blue curve) and fraudulent companies

(red curve). The number of active companies is consistently growing. A similar trend can be noticed in

the number of fraudulent companies.

constructions. Companies that inherit (many) resources of fraudulent companies, exhibit a high risk

of perpetrating fraud in the future as well. Figure 1b shows how (groups of) resources are exchanged

between various companies, transferring fraudulent knowledge on how to commit fraud (Levin and

Cross 2004) towards legitimate companies. We must note that resource sharing is nevertheless

a normal activity in the corporate environment, complicating the detection process. Although

the exact procedure of resource sharing is conﬁdential, the reader can think in terms of e.g., the

transfer or sharing of employees, equipment, buyers/suppliers, and addresses taken over by other

employers, etc. The requirements of fraud experts are threefold: (1) curtailing the growth of existing

spider constructions; (2) preventing the development of new spider constructions; and (3) detecting

uncaught spider constructions, i.e., dense subgraphs in the network with many bankruptcies. In

this work, we focus on requirement (1) and (2). Recall that we do not have information to associate

key companies to their side companies. Therefore, we aim to ﬁnd suspicious side companies.

2.2. Challenges

A ﬁrst contribution of this research is the investigation and identiﬁcation of the underlying

reasons why fraud detection cannot be resolved by applying standard data analytics. We identify

ﬁve challenges present in most fraud detection problems, and discuss how each challenge can be

addressed. In general, the main challenges that characterize fraud are as follows:

GOTCHA! Network-Based Fraud Detection for Social Security Fraud

Figures

Citations

Fraud Analytics Using Descriptive, Predictive And Social Network Techniques: A Guide To Data Science For Fraud Detection

The value of big data for credit scoring: Enhancing financial inclusion using mobile phone data and social network analytics

Generative adversarial network based telecom fraud detection at the receiving bank

Social network analytics for churn prediction in telco

Auto loan fraud detection using dominance-based rough set approach versus machine learning methods

References

Random Forests

The Elements of Statistical Learning

SMOTE: synthetic minority over-sampling technique

The PageRank Citation Ranking : Bringing Order to the Web

SMOTE: Synthetic Minority Over-sampling Technique

Related Papers (5)

Random Forests

Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research

The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature

The PageRank Citation Ranking : Bringing Order to the Web

Statistical Fraud Detection: A Review

Frequently Asked Questions (9)

Q1. What are the contributions in "Gotcha! network-based fraud detection for social security fraud" ?

Q2. What future works have the authors mentioned in the paper "Gotcha! network-based fraud detection for social security fraud" ?

Q3. How does GOTCHA improve the intrinsic baseline?

Q4. What is the iterative propagation procedure for bipartite graphs?

Q5. What is the adjacency matrix of a bipartite graph?

Q6. How will the future work elaborate on active learning?

Q7. What is the corresponding matrix representation of size of a graph?

Q8. How many iterations of the process are needed to make sure that the final exposure score is?

Q9. What are the types of variables that can be classified as direct and indirect?