Internet Financial Fraud Detection Using a Distributed Big Data Approach with Node2vec

Abstract

The rapid growth of online transactions,of financial fraud has become a significant threat. Traditional detection methods and struggle to process massive, rapidly evolving datasets. This project proposes a scalable and accurate fraud detection system using a distributed big data framework integrated with Node2vec, a graph-based machine learning technique that captures relational and structural information about transaction networks. The model transforms transaction data into graphs, embeds features via Node2vec, and classifies transactions using advanced ML models like XGBoost and Random Forest. The system leverages Apache Spark for distributed processing, ensuring high performance on large datasets.

1.Data Ingestion

This module is in the role of collecting and ingesting financial transaction data, either in real time or in batches, from a variety of sources, including digital wallet (UPI) systems, credit card statements, and bank transaction logs. HDFS (Hadoop Distributed File System) and Apache Kafka are two tools used for fault-tolerant, high-throughput ingestion of historical and streaming data.

Data Ingestion

This module is in the role of collecting and ingesting financial transaction data, either in real time or in batches, from a variety of sources, including digital wallet (UPI) systems, credit card statements, and bank transaction logs. HDFS (Hadoop Distributed File System) and Apache Kafka are two tools used for fault-tolerant, high-throughput ingestion of historical and streaming data.

Subfolders:

raw data samples → Contains sample raw data files from different sources (CSV, JSON, etc.).

Kafka producers → Scripts to simulate or connect with real-time transaction streams.

ingestion logs → Logging mechanism to trace ingestion events, failures, retries, etc.

2.Data of Preprocessing

This folder includes all processes related to cleaning the raw transaction data, handling missing values, detecting and correcting anomalies, and creating derived features (feature engineering). Distributed processing is achieved using PySpark, enabling fast operations even on terabytes of data.

Subfolders:

spark cleaning scripts/ → PySpark scripts for null removal, type casting, and transformation.

feature of engineering → Code to extract useful features like transaction frequency, merchant category, etc.

preprocessing of notebooks → Jupyter/Colab notebooks for EDA and sanity checks.

logs → Spark execution logs and performance stats.

3.Graph Construction

In this phase, transactional data is transformed into a graph structure. Each node represents a user or merchant, and each edge represents a transaction. The graph captures the relational and structural interaction pattern essential for deep fraud detection.

Subfolders:

graph creation scripts/ → Python or PySpark code to create graphs using NetworkX or GraphFrames.

edge list files → CSV or Parquet files listing transaction-based edges with attributes like amount, time, and frequency.

sample graphs → Visualization images and serialized graph objects for exploration.

4.Node2Vec Embeddings

his module learns low-dimensional representations (embeddings) for each node (user/merchant) using Node2vec, which preserves the graph structure and behavior in vector form. These embeddings will be used for downstream ML tasks.

Subfolders:

node2vec_training → Scripts to train Node2vec using different hyperparameters (walk length, dimensions, etc.).

embeddings output → Serialized embedding vectors in .csv/.npy format.

analysis → Visualization of embeddings using PCA/t-SNE and clustering analysis.

5.Fraud Detection Model

This core component trains and evaluates machine learning models on the embedded transaction data to classify fraudulent and non-fraudulent behavior. Models like XGBoost, Random Forest, and Logistic Regression are used and compared.

Subfolders:

train test split → Scripts for splitting the dataset using stratified sampling or k-fold.

model training/→ ML training pipelines with hyperparameter tuning.

model_evaluation/ → Code for computing metrics like accuracy, precision, recall, F1-score, and AUC.

saved models → Serialized models (e.g., .pkl, .joblib).

comparison results/ → Performance comparison with/without Node2vec and between models.

6.Evaluation Metrics

Dedicated folder for performance analysis and comparison of model results across different configurations, embedding techniques, or datasets.

Subfolders:

confusion matrices/ → Graphs and CSVs for confusion matrix outputs.

ROC curves → AUC-ROC plots for various models.

precision recall plots → Precision-recall tradeoff graphs.

model report docs → PDFs or markdown files summarizing results and observations.

7.Deployment (optional but recommended)

This module enables real-time or batch predictions via APIs. Also supports visual dashboards using Streamlit or Dash to show fraud patterns, alerts, and trends.

Subfolders:

API service → Flask or FastAPI backend to serve ML model predictions.

Streamlit dashboard → Python scripts for interactive dashboards.

deployment config → Dockerfiles, requirements.txt, deployment scripts.

frontend static → Optional HTML/CSS/JS if you integrate any static pages.

8.Dataset

Contains datasets used for training, testing, and validating the models. Also includes any synthetic data generation code.

Subfolders:

kaggle dataset → Downloaded and preprocessed the Kaggle fraud dataset.

IEEE dataset → IEEE-CIS data files and schema documentation.

synthetic_generator/ → Scripts to generate synthetic transaction data with labeled fraud.

Schema docs are YAML or JSON descriptions of all dataset fields and types.

9.Documentation

Project documentation, papers, design specs, meeting notes, and future roadmap documents go here.

Subfolders:

project proposal → Initial proposal, problem statement, and objectives.

tech stack docs → Explanations of tools used (Kafka, Spark, Node2vec, etc.).

research papers → PDF papers or links related to graph-based fraud detection.

readme files → Main readme_md for GitHub and other supporting documentation.

10 Utilities

Miscellaneous helper scripts and utilities used across modules.

Subfolders:

configfiles → YAML/JSON files for environment, paths, and model parameters.

logging utils → Custom logging functions.

data validation → Scripts for schema and integrity checks.

Helper functions → Utility functions (e.g., date formatting, metrics calculation).

11.Results and Reports

This section stores final outcomes, reports, and presentations to be used for project submission or demonstration.

Subfolders:

final results → Summary tables of model metrics and comparisons.

plots and charts → All final figures used in the report or dashboard.

presentation slides → PPT or PDF for academic/industrial review.

final report → The full report document includes the abstract, methodology, results, and conclusion.

Summary:

This structure allows:

Parallel development (e.g., model building and API design simultaneously).

Scalability (easy to replace models, datasets, or visualization tools).

Reusability (Node2vec embeddings can be reused in other ML tasks).

Maintainability (clean logs, modularity, and documentation).