Team Members

@manaswini_k_f89ce497b2a92
@chandan_rohith
@phani_jayanth
@puneeth-0307

Introduction

With the rise of digital payments and online banking, financial transactions have increased massively. While this improves efficiency, it also makes fraud harder to detect.
Traditional auditing methods like manual checking and sampling are slow and not suitable for handling large volumes of data. Modern solutions use machine learning, but many rely on labelled data, are expensive to maintain, and are difficult for small and medium businesses to adopt. They also lack clear explanations for why a transaction is flagged.
This creates a need for a smarter, cost-effective auditing system that can work with large datasets while still allowing human validation.

Our Solution

To solve this, we built AuditHawk, an ML-powered auditing system designed to simplify fraud detection.
AuditHawk uses unsupervised learning to identify unusual patterns without needing labelled data. Instead of expensive real-time systems, it processes data in batches, making it more affordable and practical for SMEs.
The system detects anomalies and presents them through a simple dashboard, while still keeping humans in the loop to review flagged transactions. This helps reduce false positives and improves reliability.
We also use MongoDB to store and manage large volumes of audit data efficiently, thanks to its flexible and scalable structure.

Tech-Stack

Frontend: Flask
Backend: Python / Django (utilizing Graphene for a strict GraphQL API)
Database: MongoDB (The absolute backbone of our data pipeline)
ML: Python-based ensemble models (Local Outlier Factor, Autoencoders, and Graph Topology)

Why we chose mongoDB?

When we first started building this, we knew that dealing with massive, unpredictable CSV files from different banks would be a nightmare in a rigid SQL database. We needed NoSQL flexibility to ingest chaotic data.
Every time an auditor uploads a massive CSV, MongoDB acts as the primary brain for data ingestion and cleanup. It instantly stores the raw transactions securely, crunches the basic math to find obvious fraud patterns, and safely passes a refined, memory-safe list over to our Python Machine Learning models. we solved our two biggest hurdles using mongoDB: server crashes and exploding cloud storage costs.

Key-Features

1.Using MongoDB Aggregation PipeLine:
Most fraud detection systems focus on big, obvious anomalies. But in reality, a lot of fraud happens through small, repeated transactions—like someone quietly taking ₹4–₹5 again and again.
We designed AuditHawk to catch exactly that.
Initially, our Python backend struggled because it tried to load millions of records into memory, which caused crashes. So we moved this processing to MongoDB’s aggregation pipeline, using stages like $match and $group.
This allowed the database itself to process millions of transactions efficiently, without overloading our server.

2. Explainable AI (XAI)
Instead of just flagging a transaction as suspicious, AuditHawk explains why.
For every flagged transaction, the system generates a simple, human-readable explanation. It highlights the risk factors and unusual patterns so auditors know exactly what to look at.
This makes the system more transparent and reduces the need for guesswork.

3.Automated Data Management (Using MongoDB TTL)
Storing large amounts of raw audit data can quickly become expensive.
To handle this, we used MongoDB’s TTL (Time-To-Live) indexes. Every uploaded transaction is automatically timestamped, and after a fixed period (like 90 days), unnecessary raw data is deleted in the background.
This keeps the database clean and efficient—without needing manual scripts or maintenance.

4. Graph-Based Fraud Detection
Sometimes, a single transaction looks completely normal. But when you look at the bigger picture, patterns start to appear like multiple users sending money to the same account.
To catch this kind of coordinated fraud, we used Python’s NetworkX library to map relationships between users, accounts, and merchants. Instead of looking at transactions in isolation, AuditHawk analyses how they are connected.
This helps us identify suspicious patterns like circular payments or hidden fraud networks; things that are almost impossible to spot in a regular spreadsheet.

Work-Flow

Step 1: Secure Upload
The auditor uploads a CSV file through the web dashboard and sets a risk threshold. The file is sent securely to the backend using a token-based request.
frontend reads the file into a text string, packages it with a secure JWT token, and fires it off to our Django server via a GraphQL mutation.

Step 2: Store the Data
The backend processes the file and adds a timestamp to each transaction. this stamp is critical for our TTL data governance later.
The entire dataset is then stored in MongoDB for further analysis.

Step 3: Fast Pre-Filtering (MongoDB)
Before running any AI models, we use MongoDB’s aggregation pipeline to scan the data and quickly identify suspicious patterns like repeated small transactions (salami slicing).
This step reduces the dataset size and ensures faster, memory-efficient processing.

Step 4: AI Analysis
The filtered data is passed to our machine learning models, which detect anomalies and calculate risk scores. The system also generates clear explanations for each flagged transaction.

Step 5: Results Dashboard
Finally, the results are sent back to the dashboard, where the auditor can review flagged transactions, understand the reasons behind them, and take action.

Step 6: Automated Cleanup
The process doesn’t stop after analysis. To avoid storing unnecessary data, AuditHawk automatically cleans up old records.
Every transaction is timestamped when it’s stored in MongoDB, and after 90 days, the database automatically removes unused raw data using TTL (Time-To-Live) indexes.
This keeps storage costs low and ensures the system stays efficient—while important flagged transactions are safely retained for future reference.

overall workflow of AuditHawk

ML Workflow:

Phase 1: Ingestion & pre-processing
The auditor uploads a CSV file through a GraphQL endpoint. Each transaction is timestamped and stored in MongoDB.
Before using ML, we run MongoDB’s aggregation pipeline to quickly detect patterns like repeated small transactions (salami attacks). This reduces the data size and avoids overloading Python.
Phase 2: Feature Engineering
The data is prepared for analysis by converting text fields into numbers and extracting time-based features. Transactions identified earlier are also flagged for quick reference.
Phase 3: ML Analysis
Since fraud data is rarely labelled, AuditHawk uses multiple unsupervised models to learn what “normal” looks like and detect deviations.
• Local Outlier Factor (LOF): Finds transactions that don’t match normal behaviour.
• Autoencoders: Attempts to reconstruct transactions—if it fails, the transaction is likely abnormal.
• Graph Topology: Analyses relationships between accounts to detect suspicious networks or circular money flow.
Phase 4: Risk Scoring & Explainable AI
The system combines outputs into a risk score and generates simple explanations for each flagged transaction. Only suspicious data is stored back in MongoDB.
Phase 5: Human Review
Auditors review the results, take action, and mark trusted transactions. This helps the system improve over time and reduce false positives.

Machine Learning workflow used in AuditHawk

Challenges we Faced

Challenge 1: Server Crashes with Large Data
The Problem:
In the early stages, we used Python (Pandas) to detect salami attacks. But when we tested with a large dataset (around 5 million rows), the entire file was loaded into memory at once. This caused our server to run out of memory and crash.
The Fix:
We realized Python shouldn’t handle heavy data processing. So, we shifted this logic to MongoDB using its aggregation pipeline.
This allowed the database to process large datasets efficiently without loading everything into memory, completely solving the crash issue.

Challenge 2: Database Growing Too Fast
The Problem:
As more CSV files were uploaded, we realized our database was growing very quickly. Storing millions of normal (non-fraud) transactions long-term would lead to high cloud storage costs and unnecessary data buildup.
The Fix:
To solve this, we used MongoDB’s TTL (Time-To-Live) indexes.
Each transaction is automatically timestamped, and after 90 days, unnecessary raw data is deleted in the background. Only important flagged transactions are retained for future reference.

Conclusion

AuditHawk shows how auditing can move beyond slow, manual processes into a smarter and more efficient system. By combining unsupervised machine learning, scalable data handling, and a human-in-the-loop approach, we were able to build a solution that is both practical and reliable.
Instead of relying on expensive, complex setups, AuditHawk focuses on efficiency; processing large datasets, detecting hidden fraud patterns, and keeping costs under control using tools like MongoDB.
While there is still room for improvement, this project highlights how the right mix of AI and system design can make auditing faster, clearer, and more accessible especially for smaller organizations.

Try It Yourself

Chandan-Rohith / AuditHawk

Automated Financial Fraud Detection System for SMEs.

AuditHawk

The Tech Stack Frontend: Vanilla JavaScript, HTML5, Tailwind CSS, Flask (Proxy Server).

Backend API: Django, GraphQL (Graphene).

Database: MongoDB (using PyMongo with atomic, session-based transactions).

Machine Learning: PyTorch/TensorFlow (Autoencoders), Scikit-Learn (Local Outlier Factor), NetworkX (Graph Theory).

Client-Side Libraries: html2canvas & jsPDF (for compliance reporting).

The Main Components

The Ingestion Engine (CSV Parser) A secure data pipeline that ingests raw, unstructured corporate financial data, cleanses the data types, and prepares it for multi-dimensional feature extraction.
The Independent ML Decision Matrix (The Brains) An unsupervised machine learning orchestrator that routes data through four distinct, parallel mathematical engines:

Deep Learning Autoencoder: Learns the behavioral "embedding" of every employee to catch Account Takeovers (ATOs).

Local Outlier Factor (LOF): Calculates multi-dimensional spatial density to catch mathematically manufactured clusters of transactions.

NetworkX Graph Topology: Maps the flow of money as nodes and edges to detect Shell Companies and Sinkholes.

Temporal & Velocity Rules Engine: Tracks…

View on GitHub

Demo

Special Mention

We would like to extend our heartfelt gratitude to our mentor, @chanda_rajkumar , for his invaluable guidance and continuous support throughout the development of AudiHawk. His technical insights, critical feedback, and encouragement helped us shape this project.

DE

Source

This article was originally published by DEV Community and written by Chinnabathini Chandan Rohith.

Read original article on DEV Community

Back to Discover

How We Built AuditHawk with Django & MongoDB to Automate SME Auditing