Credit Card Fraud Detection with Autoencoders


Learn how to detect credit card fraud with autoencoders in this tutorial by Yuxi (Hayden) Liu, an applied
research scientist focused on developing machine learning and deep learning models and systems for given
learning tasks and Pablo Maldonado, an applied mathematician and data scientist with a taste for software
development since his days of programming BASIC on a Tandy 1000.
Fraud is a multi-billion dollar industry, with credit card fraud being probably the closest to your daily life.
Fraud begins with the theft of the physical credit card or with data that could compromise the security of
the account, such as the credit card number, expiration date, and security codes.
Traditionally, fraud detection systems rely on the creation of manually engineered features by
subject matter experts, working either directly with financial institutions or with specialized software vendors.
One of the biggest challenges in fraud detection is the availability of labeled datasets, which are often
hard or even impossible to come by.
The first fraud example comes from a dataset made public in Kaggle,
(https://www.kaggle.com/dalpozz/creditcardfraud), by researchers from the Université Libre de Bruxelles
in Belgium. The datasets contain transactions made by credit cards in two days in September 2013 by
European cardholders. You have 492 frauds out of 284,807 transactions. Unlike toy datasets, real-life
datasets are highly unbalanced. In this example, the positive class (frauds) account for 0.172% of all
transactions.
It contains just numerical information factors which are the aftereffect of a PCA change.
Because of classification issues, the creators can't give the first highlights and more foundation data about
the information. Features V1, V2 ... V28 are the chief segments with PCA; the main features which have not
been changed with PCA are Time and Amount.
The feature, Time contains the seconds elapsed between each transaction and the first transaction in
the dataset. The feature, Amount is the transaction's amount; this feature can be used for example-dependent,
cost-sensitive learning. The feature, Class is the response variable and it takes value 1 in case of fraud and 0 
otherwise.
Given the class imbalance ratio, the authors recommend measuring the area under the precision-recall
curve (AUC), instead of the confusion matrix. The precision-recall curve is also called 
ROC (receiver-operator characteristic).
At this point, you might be thinking, well, why should you bother with autoencoders since this is clearly
a binary classification problem, and you already have the labeled data? Sure, you can go the traditional
way and try to do standard supervised learning algorithms, such as random forests or support vector machines,
but be careful about oversampling the fraud class or undersampling the normal class so that these methods
can perform well.
However, in many real-life instances, you do not have the labeled data beforehand, and in complex
fraud scenarios, it might be very tricky to get an accurate label.
Previous to the fraud (or even after) you may have a completely normal activity in your account.
So should you flag all of your transactions as a rogue or only a certain subset? Some people in the
business may argue that after all, the transactions were committed by a criminal, so they are tainted
somehow and you should flag all your activity, introducing bias into the model. Instead of relying on the label,
treat the problem as an anomaly detection or outlier detection problem and use autoencoders.
Exploratory data analysis
An often overlooked step is exploratory data analysis. Before jumping straight into the data and trying to do
fancy deep learning architectures, step back and look at what you have around.
Start by downloading the dataset from Kaggle: (https://www.kaggle.com/dalpozz/creditcardfraud) and
importing it into R. Note that you can find all the code files related to this article
at https://github.com/PacktPublishing/R-Deep-Learning-Projects/tree/master/Chapter03.
df <- read.csv("./data/creditcard.csv", stringsAsFactors = F)
head(df)
Before moving on, you should do a basic sanity check. Some of the things you should look for are:
  • Verifying that there are indeed only two classes (0 for normal transactions, 1 for fraudulent)
  • Verifying that the timestamp corresponds to two days
  • Checking that there are no missing values
Once this is done, you can perform two quick checks; an idea would be to see if there is an obvious pattern
between the time of day and the amount. Perhaps, fraudulent transactions happen at a certain time when
your system is vulnerable. You should check this first:
library(ggplot2)
library(dplyr)
df %>% ggplot(aes(Time,Amount))+geom_point()+facet_grid(Class~.)
Now see if there is some seasonality pattern. Plot the time variable against the amount, per class:
https://www.packtpub.com/graphics/9781788478403/graphics/e299c09e-d9a3-4ea5-89c6-671519e7fbbe.png
Quick inspection for fraud: the class 0 corresponds to normal transactions and the class 1 to fraudulent transactions.
Nothing jumps out. Interestingly, the amount involved in fraudulent transactions is much lower than
in normal transactions. This suggests that you should filter out the transactions and look at them on the
right scale. For this, use dplyr and filter out the transactions above 300 and look at smaller transactions:
df$Class <- as.factor(df$Class)
df %>%filter(Amount<300) %>%ggplot(aes(Class,Amount))+geom_violin()
How does the distribution look by class? The following plot will tell you:
https://www.packtpub.com/graphics/9781788478403/graphics/f7373263-fe02-4ebd-b7e6-7f2ff8be8549.png
First insight on the data: the amount involved in fraudulent transactions seems more likely to be around 100 than
in non-fraudulent transactions
Aha! So you got your first insight on the data! Fraudulent transactions, although much smaller, are
anomalously centered around 100. This might be part of the fraudster's strategy, instead of having large
amounts at regular times; they hide small amounts more or less uniformly in time.
Sure, this was fun to find out, but it is definitely not a scalable approach and requires domain knowledge
and intuition. It is time to try something more sophisticated.
The autoencoder approach – Keras
OK, time to get into Keras. You should leave apart a small fraction of the data to use as a validation or test set,
and develop the model on the remaining. There is no golden standard as to how this should be done.
For this example, you can use a 10% test set and a 90% training set:
# Remove the time and class column
idxs <- sample(nrow(df), size=0.1*nrow(df))
train <- df[-idxs,]
test <- df[idxs,]
y_train <- train$Class
y_test <- test$Class
X_train <- train %>% select(-one_of(c("Time","Class")))
X_test <- test %>% select(-one_of(c("Time","Class")))
# Coerce the data frame to matrix to perform the training
X_train <- as.matrix(X_train)
X_test <- as.matrix(X_test)
Notice that the Class and Time columns are excluded. Ignore the label and treat your fraud detection
problem as an unsupervised learning problem, hence you need to remove the label column from the training
data.
As for the temporal information, there does not seem to be an obvious time trend. Furthermore, in real-life
fraud detection scenarios, you should be rather concerned about the intrinsic properties of the fraudster,
for instance, the device used, geolocation information, or data from the CRM system, as well as account
properties (balance, average transaction volume, and so on). 
For the architecture of the autoencoder, instead of using one intermediate layer as before, you can now
use a stacked autoencoder. A stacked autoencoder is nothing more than several layers of encoders,
followed by layers of decoders.
In this case, you’ll use a network with outer encoder-decoder layers of 14 fully connected neurons, two
inner layers of 7 neurons and yet another inner layer of 7 neurons. You can experiment with different
architectures and compare results; there is no universally correct architecture for autoencoders,
and it relies merely on experience and on diagnosing your model via validation plots and other metrics.
Your input (and output) dimension is 29 in each case. The code to construct the autoencoder is as follows:
library(keras)
input_dim <- 29
outer_layer_dim <- 14
inner_layer_dim <- 7
input_layer <- layer_input(shape=c(input_dim))
encoder <- layer_dense(units=outer_layer_dim, activation='relu')(input_layer)
encoder <- layer_dense(units=inner_layer_dim, activation='relu')(encoder)
decoder <- layer_dense(units=inner_layer_dim)(encoder)
decoder <- layer_dense(units=outer_layer_dim)(decoder)
decoder <- layer_dense(units=input_dim)(decoder)
autoencoder <- keras_model(inputs=input_layer, outputs = decoder)
You can look at your work to check if everything is correct:
autoencoder
Model
_________________________________________________________________________________
Layer (type) Output Shape Param #
============================================================================
input_5 (InputLayer) (None, 29) 0
_________________________________________________________________________________
dense_17 (Dense) (None, 14) 420
_________________________________________________________________________________
dense_18 (Dense) (None, 7) 105
_________________________________________________________________________________
dense_22 (Dense) (None, 7) 56
_________________________________________________________________________________
dense_23 (Dense) (None, 7) 56
_________________________________________________________________________________
dense_24 (Dense) (None, 14) 112
_________________________________________________________________________________
dense_25 (Dense) (None, 29) 435
==========================================================================
Total params: 1,184
Trainable params: 1,184
Non-trainable params: 0
You are now ready to begin your training. First compile the model and then fit it:
autoencoder %>% compile(optimizer='adam',
                       loss='mean_squared_error',
                       metrics=c('accuracy'))
history <- autoencoder %>% fit(
X_train,X_train,
epochs = 10, batch_size = 32,
validation_split=0.2
)
plot(history)
The results are as follows. You can see that there is an increase in accuracy as the number of epochs increases:
https://www.packtpub.com/graphics/9781788478403/graphics/3ae958d5-9d8f-49ed-b5cf-676ed5af7c63.png
Diagnostic plots for your 14-7-7-7-14 architecture
Once you have the autoencoder ready, use it to reconstruct the test set:
# Reconstruct on the test set
preds <- autoencoder %>% predict(X_test)
preds <- as.data.frame(preds)
Now look for anomalously large reconstruction errors, as before, to be labeled as unusual. For instance,
look at those points whose reconstruction error is larger than 30 and declare them as unusual: 
y_preds <- ifelse(rowSums((preds-X_test)**2)/30<1,rowSums((preds-X_test)**2)/30,1)
Again, this threshold is not set in stone, and using your test set in your particular application, you can
fine-tune it and find the most suitable threshold for your problem.
Finally, generate the ROC curve to see if your model is performing correctly:
library(ROCR)
pred <- prediction(y_preds, y_test)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, col=rainbow(10))
You can see that the results are satisfactory. Your curve looks quite straight and the reason for that is
the output of your model is just binary, as well as your original labels. When your model inputs class
probabilities or a proxy for it, then the curve would be smoother:
https://www.packtpub.com/graphics/9781788478403/graphics/87e497ec-6678-4287-bfd6-0ebe1ddcfe89.png
ROC curve: It looks quite straight since the outputs of the model are not class probabilities, but binary
Fraud detection with H2O
Now try a slightly different tool that might help you in real-life deployments. It is often useful to try different
tools in the ever-growing data science landscape, if only for sanity check purposes.
H2O is an open source software for doing big data analytics. The young start-up (founded in 2016) boasts
some top researchers in mathematical optimization and statistical learning theory on their advisory board.
It runs in standard environments (Linux/Mac/Windows) as well as big data systems and cloud computing
environments. 
You can run H2O in R, but you need to install the package first:
install.packages("h2o")
Once this is done, you can load the library:
library(h2o)
You will then see a welcome message, among some warnings (objects that are masked from other packages):
Your next step is to start H2O:
> h2o.init()
For H2O package documentation, ask for help:
> ??h2o
After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai
So do this and then, you’ll be ready for work:
h2o.init()
Now you need to read your data into H2O. As the computations work somehow differently, you cannot
use the vanilla dataframe structure from R, so you can read the file as usual and then coerce it:
df <- read.csv("./data/creditcard.csv", stringsAsFactors = F)
df <- as.h2o(df)
Alternatively, you can read it with the h2o.uploadFile function:
df2 <- h2o.uploadFile("./data/creditcard.csv")
Either way, the resulting structure type is no longer a dataframe but an environment. 
Now leave aside one portion of the data for training and one for testing, as usual. In h2o, you can use
the h2o.splitFrame function:
splits <- h2o.splitFrame(df, ratios=c(0.8), seed=1)
train <- splits[[1]]
test <- splits[[2]]
Now identify between features and label, which will be useful in a minute:
label <- "Class"
features <- setdiff(colnames(train), label)
You are ready to start the training of your autoencoder:
autoencoder <- h2o.deeplearning(x=features,
training_frame = train,
autoencoder = TRUE,
seed = 1,
hidden=c(10,2,10),
epochs = 10,
activation = "Tanh")
Some comments are in order. The autoencoder parameter is set to true as you would expect. You can use a
slightly different architecture this time, just for illustration purposes. You can see the structure of the layers
in the hidden parameter. You can also use a different activation function. In practice, it is sometimes useful
to use bounded activation functions, such as tanh instead of ReLu, which can be numerically unstable.
You can generate the reconstructions in a similar way as you did with keras:
# Use the predict function as before
preds <- h2o.predict(autoencoder, test)
You’ll get something like this:
> head(preds)
reconstr_Time reconstr_V1 reconstr_V2 reconstr_V3 reconstr_V4 reconstr_V5 reconstr_V6 reconstr_V7
1 380.1466 -0.3041237 0.2373746 1.617792 0.1876353 -0.7355559 0.3570959 -0.1331038
2 1446.0211 -0.2568674 0.2218221 1.581772 0.2254702 -0.6452812 0.4204379 -0.1337738
3 1912.0357 -0.2589679 0.2212748 1.578886 0.2171786 -0.6604871 0.4070894 -0.1352975
4 1134.1723 -0.3319681 0.2431342 1.626862 0.1473913 -0.8192215 0.2911475 -0.1369512
5 1123.6757 -0.3194054 0.2397288 1.619868 0.1612631 -0.7887480 0.3140728 -0.1362253
6 1004.4545 -0.3589335 0.2508191 1.643208 0.1196120 -0.8811920 0.2451117 -0.1380364
From here on you can proceed as earlier. However, h2o has a built-in function, h2o.anomaly that simplifies
a part of your work.
Another simplification you can perform is instead of importing ggplot2 and dplyr separately, you can
import the tidyverse package that brings these (and other) packages for data manipulation into your
environment:
You can call this function and do a bit of formatting to make the row names a column itself, as well as adding
the label for the real class:
library(tidyverse)
anomaly <- h2o.anomaly(autoencoder, test) %>%
as.data.frame() %>%
tibble::rownames_to_column() %>%
mutate(Class = as.vector(test[, 31]))
Now calculate the average mean square error:
# Type coercion useful for plotting later
anomaly$Class <- as.factor(anomaly$Class)
mean_mse <- anomaly %>%
group_by(Class) %>%
summarise(mean = mean(Reconstruction.MSE))
Finally, visualize your test data as per the reconstruction error:
anomaly$Class <- as.factor(anomaly$Class)
mean_mse <- anomaly %>%
   group_by(Class) %>%
   summarise(mean = mean(Reconstruction.MSE))
You can see that the autoencoder does a fair job. A good proportion of fraud cases have a relatively high
reconstruction error, although it is far from perfect. How could you improve it?
https://www.packtpub.com/graphics/9781788478403/graphics/155a4c55-f220-4dac-abdc-03e5a4d1e2b9.png
Results from your architecture using H2O, you can see that the autoencoder does a good job flagging the
fraud cases, but it could still be improved.
If you found this article interesting, you can explore R Deep Learning Projects to have a better understanding
of deep learning concepts and techniques and how to use them in a practical setting. This book demonstrates
end-to-end implementations of five real-world projects on popular topics in deep learning such as handwritten
digit recognition, traffic light detection, fraud detection, text generation, and sentiment analysis.

Commentaires

Posts les plus consultés de ce blog

How to Build Machine Learning Models in C#

Artificial Intelligence By Example

25 Start-ups using Machine Learning differently in 2018: From farming to brewing beer to elder care