description: "A complete technical breakdown of how Vigil AI detects AI-generated voices and deepfake video calls using MFCC analysis, pitch jitter, MediaPipe landmarks, Fourier transforms, and optical flow — with all the code decisions explained."
tags: python, machinelearning, ai, security
canonical_url: https://vigilai.online
I Built a Real-Time Deepfake Detector in Python — 9 Signal Layers, Full Architecture, Free to Use
Deepfake fraud crossed $860 million in losses in 2024. Every major solution on the market costs thousands of dollars per month. So I built one that's free — and open about exactly how it works.
This is a complete technical breakdown of Vigil AI — the architecture, the signal selection decisions, the code, the tradeoffs, and everything I learned building a real-time audio and video deepfake detector as a solo developer.
Live demo: www.vigilai.online
Why These 9 Signals? The Decision-Making Process
Most deepfake detectors are black boxes. They train a neural network, it outputs a probability, done. That is fine for accuracy but terrible for:
- Explainability — banks and enterprises need to know why a call was flagged
- False positive debugging — when someone legitimate gets flagged, you need to know which signal triggered
- Incremental improvement — you can improve one signal at a time without retraining everything
- Running on CPU — a neural network needs GPU inference; weighted rule signals run on any ₹400/month VPS
Here is how I selected each signal.
Audio Detection Architecture
Signal 1 & 2: MFCC Variance and MFCC Delta Variance
Why MFCC? Mel-Frequency Cepstral Coefficients are the standard representation of audio for speech processing. They capture the shape of the vocal tract's frequency response — essentially a fingerprint of how sound is being produced.
The key insight: AI voice synthesis models are trained to reproduce the mean spectral characteristics of a voice. They are very good at this. What they consistently fail to reproduce is the variance — the natural randomness and micro-variation in how a real human produces speech moment to moment.
import librosa
import numpy as np
def extract_mfcc_signals(y, sr):
# Extract 40 MFCC coefficients
mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40)
mfcc_var = float(np.var(mfcc))
# Delta = rate of change of MFCCs
# AI voices change too smoothly — delta variance is too low
mfcc_delta = librosa.feature.delta(mfcc)
delta_var = float(np.var(mfcc_delta))
return mfcc_var, delta_var
# Thresholds determined empirically
# mfcc_var < 2800 → suspicious
# delta_var < 80 → suspicious
I give MFCC Variance a weight of 3 because it is the single most reliable signal across the test data I have run. An AI voice almost always fails this check.
Signal 3: Pitch Jitter — The Most Powerful Signal
This took me the longest to get right and has the highest detection accuracy of anything I have built.
The physics: Human pitch (fundamental frequency, F0) is controlled by the tension and mass of the vocal folds, which are in turn controlled by tiny muscles with their own mechanical variability. This creates micro-variations in pitch — called jitter — that are always present in real speech.
AI voice synthesis models smooth out these variations. They produce a pitch contour that follows the learned prosody patterns but without the micro-level noise that organic vocal production creates.
def extract_pitch_jitter(y, sr):
# pyin is more accurate than yin for pitch tracking
f0, voiced_flag, voiced_probs = librosa.pyin(
y,
fmin=librosa.note_to_hz('C2'), # ~65 Hz — lowest human voice
fmax=librosa.note_to_hz('C7'), # ~2093 Hz — highest human voice
)
# Only measure jitter on voiced frames
voiced_f0 = f0[voiced_flag == 1] if voiced_flag is not None else np.array([])
if len(voiced_f0) > 10:
# np.diff gives the frame-to-frame pitch change
# real voices: std of these changes is > 0.003
# AI voices: std is near-zero — too smooth
pitch_jitter = float(np.std(np.diff(voiced_f0[~np.isnan(voiced_f0)])))
else:
pitch_jitter = 0.0 # No voiced frames = suspicious
return pitch_jitter
# Threshold: pitch_jitter < 0.003 → suspicious
# Weight: 3x (highest weight in the system)
I discovered this signal by accident. I was listening to a flagged audio sample that passed all other checks and noticed it sounded "robotic" in a way I could not articulate. I plotted the F0 contour and it looked like a sine wave — perfectly smooth. Real speech looks like a noisy mountain range.
Signal 4: Harmonic Ratio
def extract_harmonic_ratio(y):
# Separate harmonic and percussive components
harmonic, percussive = librosa.effects.hpss(y)
# AI voices are over-harmonic — they are too "clean"
# Real speech has significant percussive content from
# consonants, breath noise, lip sounds, mouth clicks
harmonic_ratio = float(
np.mean(np.abs(harmonic)) / (np.mean(np.abs(percussive)) + 1e-8)
)
return harmonic_ratio
# Threshold: harmonic_ratio > 6.0 → suspicious (too clean)
# Weight: 2x
Signals 5–9: Supporting Signals
def extract_supporting_signals(y, sr):
# Zero Crossing Rate — how often the waveform crosses zero
# AI voices have unnatural ZCR patterns
zcr = float(np.mean(librosa.feature.zero_crossing_rate(y)) * 1000)
# Spectral Centroid — "center of mass" of the spectrum
# AI voices have unnaturally stable centroid
centroid = librosa.feature.spectral_centroid(y=y, sr=sr)
centroid_std = float(np.std(centroid))
# Chroma Variation — organic pitch change patterns
chroma = librosa.feature.chroma_stft(y=y, sr=sr)
chroma_var = float(np.var(chroma))
# RMS Energy — AI voices have unnaturally uniform volume
rms = librosa.feature.rms(y=y)
rms_var = float(np.var(rms) * 1e6)
# Spectral Flatness — AI voices are overly tonal
flatness = float(np.mean(librosa.feature.spectral_flatness(y=y)))
return zcr, centroid_std, chroma_var, rms_var, flatness
The Weighted Scoring System
This is the core design decision that separates Vigil AI from naive threshold-based detection:
WEIGHTS = {
"MFCC Variance": 3,
"Pitch Jitter": 3,
"MFCC Delta Var": 2,
"RMS Consistency": 2,
"Harmonic Ratio": 2,
"Zero Crossing Rate": 1,
"Spectral Centroid": 1,
"Chroma Variation": 1,
"Spectral Flatness": 1,
}
MAX_WEIGHT = sum(WEIGHTS.values()) # = 17
def compute_verdict(signals_flagged):
weighted_score = sum(
WEIGHTS[name] for name, _, _, flagged in signals_flagged if flagged
)
confidence = weighted_score / MAX_WEIGHT
# Trigger if weighted suspicious score exceeds 35% of maximum
is_fake = weighted_score >= int(MAX_WEIGHT * 0.35)
return is_fake, confidence
# Example: Only MFCC Variance and Pitch Jitter flagged
# weighted_score = 3 + 3 = 6
# confidence = 6/17 = 0.35
# is_fake = True (6 >= 5.95)
# This correctly catches a case where only the two strongest signals fire
The threshold of 35% was determined empirically. At 30% there are too many false positives. At 40% some real deepfakes slip through.
Video Detection Architecture
Blink Detection with MediaPipe
import mediapipe as mp
import cv2
import numpy as np
def analyze_blink_pattern(frames):
mp_face = mp.solutions.face_mesh
face_mesh = mp_face.FaceMesh(
max_num_faces=1,
refine_landmarks=True,
min_detection_confidence=0.5,
min_tracking_confidence=0.5
)
blink_scores = []
for frame in frames:
rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
result = face_mesh.process(rgb)
if result.multi_face_landmarks:
lm = result.multi_face_landmarks[0].landmark
# Eye Aspect Ratio (EAR) using MediaPipe landmark indices
# Upper eyelid: landmark 159, Lower eyelid: landmark 145
upper = lm[159].y
lower = lm[145].y
# Normalize by face height to handle different distances
face_height = abs(lm[10].y - lm[152].y) + 1e-6
ear = abs(upper - lower) / face_height
blink_scores.append(ear)
face_mesh.close()
if not blink_scores:
return True, 0.15 # No face detected — treat as suspicious
max_ear = max(blink_scores)
# A blink occurs when EAR drops below ~78% of the maximum open-eye EAR
# Real humans: at least one blink per 30 frames at normal FPS
blinked = any(score < max_ear * 0.78 for score in blink_scores)
avg_ear = float(np.mean(blink_scores))
return blinked, avg_ear
Why this works: Deepfake face replacement models are trained primarily on open-eye frames (blinking frames are rare and often discarded during training data curation). The result is that deepfake faces either do not blink at all, or blink at unnaturally uniform intervals.
Skin Hue Variance — The Signal I Am Most Proud Of
This signal came from a paper I found about GAN-generated image detection. The core insight: GANs produce skin with unnaturally uniform color distribution. Real human skin has subsurface scattering, shadow gradients, slight warmth variations, and imperfections that create high variance in the hue and saturation channels.
def analyze_skin_hue_variance(frame, face_landmarks):
h, w = frame.shape[:2]
lm = face_landmarks.landmark
# Extract face bounding box from landmarks
x1 = int(lm[234].x * w) # Left cheek
y1 = int(lm[10].y * h) # Top of face
x2 = int(lm[454].x * w) # Right cheek
y2 = int(lm[152].y * h) # Chin
# Clamp to frame bounds
x1, y1 = max(0, x1), max(0, y1)
x2, y2 = min(w, x2), min(h, y2)
if x2 <= x1 or y2 <= y1:
return 200.0 # Cannot crop — return neutral value
face_crop = frame[y1:y2, x1:x2]
# Convert to HSV — hue and saturation variance are key
hsv = cv2.cvtColor(face_crop, cv2.COLOR_BGR2HSV)
hue_var = float(np.var(hsv[:, :, 0])) # Hue channel variance
saturation_var = float(np.var(hsv[:, :, 1])) # Saturation channel variance
# Combined skin variance score
skin_var = hue_var + saturation_var
return skin_var
# Threshold: skin_var < 80 → suspicious (AI skin = too uniform)
# Weight: 2x
In my testing, this correctly flags deepfakes that pass the blink test — particularly high-quality GAN-generated faces that have learned to blink.
Optical Flow for Motion Consistency
def analyze_optical_flow(frames):
flow_variances = []
prev_gray = None
for frame in frames:
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
if prev_gray is not None:
# Farneback optical flow — dense, captures all motion
flow = cv2.calcOpticalFlowFarneback(
prev_gray, gray, None,
pyr_scale=0.5, # Pyramid scale
levels=3, # Pyramid levels
winsize=15, # Window size
iterations=3,
poly_n=5,
poly_sigma=1.2,
flags=0
)
# Magnitude of motion at each pixel
magnitude, _ = cv2.cartToPolar(flow[..., 0], flow[..., 1])
# Real video has consistent motion variance
# Deepfake video often has frozen background or erratic motion
flow_variances.append(float(np.var(magnitude)))
prev_gray = gray.copy()
if not flow_variances:
return 0.0, False
flow_mean = float(np.mean(flow_variances))
flow_std = float(np.std(flow_variances))
# Flag if variance is extremely high (erratic) or extremely low (frozen)
is_suspicious = (
flow_std > flow_mean * 2.5 or # Erratic motion
flow_mean < 0.001 # Completely frozen background
)
return flow_mean, is_suspicious
2D Fourier Transform for Pixel Artifacts
def analyze_fft_energy(frame):
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
# 2D FFT
dft = np.fft.fft2(gray)
# Shift zero-frequency component to center
dft_shifted = np.fft.fftshift(dft)
# Log magnitude spectrum
magnitude = 20 * np.log(np.abs(dft_shifted) + 1)
energy = float(np.mean(magnitude))
# Real camera images have high-frequency components from sensor noise
# AI-generated images lack these — FFT energy is too low or too high
is_suspicious = energy < 135 or energy > 182
return energy, is_suspicious
The Real-Time Architecture I Am Building
The current Streamlit prototype is great for testing but unsuitable for production. Here is the FastAPI architecture I am building:
# main.py — FastAPI REST API
from fastapi import FastAPI, File, UploadFile, HTTPException, Depends
from fastapi.middleware.cors import CORSMiddleware
from fastapi.security import APIKeyHeader
import uvicorn
app = FastAPI(
title="Vigil AI Detection API",
description="Real-time deepfake and AI voice fraud detection",
version="1.0.0"
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["POST"],
allow_headers=["*"],
)
api_key_header = APIKeyHeader(name="X-API-Key")
async def verify_api_key(api_key: str = Depends(api_key_header)):
# Verify against Supabase api_keys table
# Track usage per key
if not await check_key_valid(api_key):
raise HTTPException(status_code=403, detail="Invalid API key")
return api_key
@app.post("/v1/detect/audio")
async def detect_audio(
file: UploadFile = File(...),
api_key: str = Depends(verify_api_key)
):
"""
Analyze audio for AI-generated voice or cloned speech.
Accepts: WAV, MP3, OGG, FLAC, M4A
Returns: verdict, confidence, per-signal breakdown
"""
audio_bytes = await file.read()
result, error = analyze_audio(audio_bytes, file.filename)
if error:
raise HTTPException(status_code=422, detail=error)
return {
"verdict": "synthetic" if result["is_fake"] else "human",
"confidence": round(result["confidence"], 4),
"risk_level": "high" if result["confidence"] > 0.75 else (
"medium" if result["confidence"] > 0.45 else "low"
),
"signals": [
{
"name": name,
"value": round(float(val), 4),
"flagged": flagged,
"weight": WEIGHTS.get(name, 1)
}
for name, val, _, flagged in result["flags"]
],
"duration_seconds": round(result["duration"], 2),
"weighted_score": result["weighted_score"],
"max_score": result["max_weight"],
}
@app.post("/v1/detect/video")
async def detect_video(
file: UploadFile = File(...),
api_key: str = Depends(verify_api_key)
):
"""
Analyze video for deepfake face manipulation.
Accepts: MP4, MOV, AVI, WebM
Returns: verdict, confidence, per-signal breakdown
"""
# ... video analysis implementation
pass
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
Example API call:
curl -X POST "https://api.vigilai.online/v1/detect/audio" \
-H "X-API-Key: va_your_key_here" \
-F "file=@suspicious_call.wav"
Response:
{
"verdict": "synthetic",
"confidence": 0.7647,
"risk_level": "high",
"signals": [
{ "name": "MFCC Variance", "value": 1842.3, "flagged": true, "weight": 3 },
{ "name": "Pitch Jitter", "value": 0.0012, "flagged": true, "weight": 3 },
{ "name": "Harmonic Ratio", "value": 7.84, "flagged": true, "weight": 2 },
{ "name": "RMS Consistency", "value": 2.1, "flagged": true, "weight": 2 },
{ "name": "Zero Crossing Rate", "value": 28.4, "flagged": true, "weight": 1 },
{ "name": "Spectral Centroid", "value": 289.2, "flagged": false, "weight": 1 },
{ "name": "Chroma Variation", "value": 0.018, "flagged": false, "weight": 1 },
{ "name": "MFCC Delta Var", "value": 62.1, "flagged": true, "weight": 2 },
{ "name": "Spectral Flatness", "value": 0.0015, "flagged": true, "weight": 1 }
],
"duration_seconds": 8.4,
"weighted_score": 13,
"max_score": 17
}
The Real-Time Call Detection SDK — Android Architecture
// VigiliAICallMonitor.kt
class VigilAICallMonitor(
private val context: Context,
private val apiKey: String
) {
private val SAMPLE_RATE = 16000
private val CHUNK_DURATION_MS = 3000
private val BUFFER_SIZE = AudioRecord.getMinBufferSize(
SAMPLE_RATE,
AudioFormat.CHANNEL_IN_MONO,
AudioFormat.ENCODING_PCM_16BIT
)
private var audioRecord: AudioRecord? = null
private var isMonitoring = false
private val webSocketClient = buildWebSocketClient()
fun startMonitoring(onResult: (verdict: String, confidence: Float) -> Unit) {
isMonitoring = true
audioRecord = AudioRecord(
MediaRecorder.AudioSource.VOICE_COMMUNICATION, // Call audio
SAMPLE_RATE,
AudioFormat.CHANNEL_IN_MONO,
AudioFormat.ENCODING_PCM_16BIT,
BUFFER_SIZE
)
audioRecord?.startRecording()
CoroutineScope(Dispatchers.IO).launch {
val chunk = ShortArray(SAMPLE_RATE * CHUNK_DURATION_MS / 1000)
while (isMonitoring) {
audioRecord?.read(chunk, 0, chunk.size)
// Convert to WAV bytes and send to API
val wavBytes = shortArrayToWav(chunk, SAMPLE_RATE)
// Send via WebSocket for lowest latency
webSocketClient.send(wavBytes)
// Receive result
val result = webSocketClient.receiveResult()
withContext(Dispatchers.Main) {
onResult(result.verdict, result.confidence)
}
}
}
}
fun stopMonitoring() {
isMonitoring = false
audioRecord?.stop()
audioRecord?.release()
audioRecord = null
}
}
Performance and Limitations
Current performance (CPU inference, no ML model):
- Audio analysis: 800ms–2.5s depending on file length
- Video analysis: 3–8s for a 30-second clip
- Real-time call chunk (3s audio): ~200ms per chunk
Known limitations:
- High-quality neural voice synthesis (VALL-E X quality) may pass pitch jitter check if trained on enough data
- Videos with heavy compression (low bitrate) affect FFT and texture analysis
- MediaPipe struggles with non-frontal face angles beyond ~45°
- Threshold values need refinement against a proper benchmark dataset
What would make it significantly better:
- Training a scikit-learn or XGBoost classifier on ASVspoof 2019 (replacing threshold rules)
- Adding speaker verification — comparing the voice against a known reference sample
- GAN artifact detection using a pretrained ResNet (would require GPU inference)
What I Need Help With
I am a solo developer. There are things I can build alone and things that would benefit from collaboration:
Research collaborators: If you work in audio signal processing or GAN artifact detection research and want to contribute signal ideas or help with the ASVspoof benchmarking, reach out.
B2B pilots: If you work at a fintech, NBFC, or any company that handles voice or video KYC, I want to give you a free API pilot. No strings attached.
Open source contributions: The core detection library will be open-sourced. If you want to contribute additional signals, dataset evaluations, or language SDKs, watch the GitHub repo (launching soon).
Getting Started
Try the live demo: vigilai.online
Run locally:
git clone https://github.com/abhishekkumar/vigilai # coming soon
cd vigilai
pip install -r requirements.txt
# Install ffmpeg (required for audio decoding):
# Mac: brew install ffmpeg
# Ubuntu: sudo apt install ffmpeg
# Windows: winget install ffmpeg
streamlit run app.py
Requirements:
streamlit>=1.32.0
librosa>=0.10.1
soundfile>=0.12.1
mediapipe>=0.10.9
opencv-python>=4.9.0
scipy>=1.12.0
matplotlib>=3.8.0
plotly>=5.20.0
supabase>=2.4.0
The Bigger Picture
Deepfake technology is not going away. The models will get better. The voices will get more convincing. The videos will become indistinguishable from reality at higher and higher quality levels.
The only viable response is detection infrastructure that scales as fast as the generation technology. Detection that is free, accessible, running in the background of every device, integrated into every KYC flow and every call system.
That is what I am building with Vigil AI. One signal at a time.
Follow the progress:
- Website: vigilai.online
- X: @vigilai_x
- Email: abhishekkumarbecool3@gmail.com
If you found this technical breakdown useful, drop a reaction and share it with anyone building in the AI safety, fraud detection, or identity verification space.
Built with Python, librosa, MediaPipe, OpenCV, FastAPI, Supabase, and a lot of late nights. Founded by Abhishek Kumar, India.
This article was originally published by DEV Community and written by Abhishek Kumar.
Read original article on DEV Community