Analyzing credit card fraud with machine learning

Fall 2021 Capstone Project by Jeffrey Romero


All project files are available for viewing/downloading on my GitHub repository.


Credit card fraud is a problem that can target any person at any time. However through the use of machine learning, it becomes easy to spot fraudulent credit card transactions. The goal of this project is to create a machine learning model that can predict whether a credit card transaction is fraudulent or not. The input data used to train the machine learning model has been downloaded from Kaggle.


A machine learning model can only predict fraudulent transactions based off formatted input data. There are a series of steps that the input data goes through before training a fraud prediction model:

  • Detect and fill in missing values
  • Encode values to make it program-readable
  • Fix imbalanced outcomes of fraudulent or not fraudulent transactions

This project undergoes a process of creating a machine learning model as shown in the figure below.

Key Components

Processing input data

For a machine learning model to predict whether a transaction is fraudulent or not, input data is required. Such input data is formatted as a .csv file which contains data related to credit card fraud. This input data is then sanitized so the model can make sense of it.

Training a model to predict fraud

Multiple models will be trained with different machine learning algorithms. Each algorithm has its own technical approach when predicting fraudulent outcomes.

Evaluating the machine learning model

The different machine learning models are compared with each other. The model with the most accurate fraudulent transaction prediction rate will be chosen.

Methods to improve fraud detection

There could be more cases of fraud compared to regular transactions. This imbalance of data can affect the model's prediction accuracy. To fix this, the model can be improved by using different data optimization algorithms.

Preview of Results

  • The number of fraudulent transactions are much less than the number of not fraudulent transactions. This imbalance of data can be fixed through techniques such as oversampling, but fixing the imbalance does not necessarily lead to more accurate predictions of fraud.

  • Time was taken to train seven different machine learning models to predict fraud. There are three accurate models with a correct fraud prediction rate of 92.05%, but the most efficient model is Logistic Regression because it had the shortest training time.

  • The table shows the target column which contains values of either 1 for a fraudulent transaction outcome, or a 0 for a not fraudulent transaction. The next five columns are what I have determined to be the biggest indicators of fraud according to the input data.